Alex Behm has posted comments on this change.

Change subject: IMPALA-3687: Prefer Avro field name during schema reconciliation
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3331/6//COMMIT_MSG
Commit Message:

Line 9: Currently Impala prefers schema definition from schema definition
How about something like the following blurb.

Since it is possible to create an Avro table with both column definitions and 
an Avro schema, Impala attempts to reconcile inconsistencies in the two schema 
definitions, generally preferring the Avro schema. The only exception to this 
rule was with CHAR/VARCHAR/STRING columns, where the column definition was 
preferred in order to support tables with CHAR/VARCHAR columns although Avro 
only supports STRING. This exception is confusing because the name for such a 
column will be taken from the column definition (and not from the Avro schema).


http://gerrit.cloudera.org:8080/#/c/3331/6/fe/src/main/java/com/cloudera/impala/util/AvroSchemaUtils.java
File fe/src/main/java/com/cloudera/impala/util/AvroSchemaUtils.java:

Line 136:       // as a CHAR/VARCHAR/STRING in the reconciled schema. Column 
name and comment
> another reason is that we did this before so it is hard to drop it. ideally
I agree with Juan, and I think we all agree with her. This reconciliation logic 
is very subtle and brittle, so it needs to go away. Ideally, we would only 
accept tables with column definitions xor an Avro schema definition. Allowing 
two schemas simultaneously is crazy (but Hive, for example, does allow it).

Given how often these kinds of Avro issues pop up, I'm very afraid that 
rejecting inconsistencies will cause many users' existing tables to break. 
Further, I think it will require manual intervention at the HMS backend DB 
level to fix such tables that have become unusable due to our "fix" to reject 
tables with inconsistent schemas definitions.

Technically, even the current patch breaks compatibility, but it seems far less 
dangerous than flat out rejecting tables with inconsistencies.

I'm in favor of proceeding with the current solution - but we definitely need 
to keep the correct fix in mind (not allowing two schemas as all) for a future 
release.


Line 143:         reconciledColDef.analyze();
analysis should never fail, might as well try catch with a Preconditions check 
in the catch. Then we don't have to throw in this function


http://gerrit.cloudera.org:8080/#/c/3331/6/testdata/workloads/functional-query/queries/QueryTest/avro-schema-resolution.test
File 
testdata/workloads/functional-query/queries/QueryTest/avro-schema-resolution.test:

Line 95: {"type": ["int", "null"], "name": "id_mismatch", "doc": "int comment 
mismatch"},
Dimitris' request to include the original test case seems reasonable. We can 
just alter the table and change the schema literal.


-- 
To view, visit http://gerrit.cloudera.org:8080/3331
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia3e43b2885853c2b4f207a45a873c9d7f31379cd
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Huaisi Xu <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Huaisi Xu <[email protected]>
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-HasComments: Yes

Reply via email to