Alex Behm has posted comments on this change. Change subject: IMPALA-3687: Prefer Avro field name during schema reconciliation ......................................................................
Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/3331/6//COMMIT_MSG Commit Message: Line 9: Currently Impala prefers schema definition from schema definition How about something like the following blurb. Since it is possible to create an Avro table with both column definitions and an Avro schema, Impala attempts to reconcile inconsistencies in the two schema definitions, generally preferring the Avro schema. The only exception to this rule was with CHAR/VARCHAR/STRING columns, where the column definition was preferred in order to support tables with CHAR/VARCHAR columns although Avro only supports STRING. This exception is confusing because the name for such a column will be taken from the column definition (and not from the Avro schema). http://gerrit.cloudera.org:8080/#/c/3331/6/fe/src/main/java/com/cloudera/impala/util/AvroSchemaUtils.java File fe/src/main/java/com/cloudera/impala/util/AvroSchemaUtils.java: Line 136: // as a CHAR/VARCHAR/STRING in the reconciled schema. Column name and comment > another reason is that we did this before so it is hard to drop it. ideally I agree with Juan, and I think we all agree with her. This reconciliation logic is very subtle and brittle, so it needs to go away. Ideally, we would only accept tables with column definitions xor an Avro schema definition. Allowing two schemas simultaneously is crazy (but Hive, for example, does allow it). Given how often these kinds of Avro issues pop up, I'm very afraid that rejecting inconsistencies will cause many users' existing tables to break. Further, I think it will require manual intervention at the HMS backend DB level to fix such tables that have become unusable due to our "fix" to reject tables with inconsistent schemas definitions. Technically, even the current patch breaks compatibility, but it seems far less dangerous than flat out rejecting tables with inconsistencies. I'm in favor of proceeding with the current solution - but we definitely need to keep the correct fix in mind (not allowing two schemas as all) for a future release. Line 143: reconciledColDef.analyze(); analysis should never fail, might as well try catch with a Preconditions check in the catch. Then we don't have to throw in this function http://gerrit.cloudera.org:8080/#/c/3331/6/testdata/workloads/functional-query/queries/QueryTest/avro-schema-resolution.test File testdata/workloads/functional-query/queries/QueryTest/avro-schema-resolution.test: Line 95: {"type": ["int", "null"], "name": "id_mismatch", "doc": "int comment mismatch"}, Dimitris' request to include the original test case seems reasonable. We can just alter the table and change the schema literal. -- To view, visit http://gerrit.cloudera.org:8080/3331 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia3e43b2885853c2b4f207a45a873c9d7f31379cd Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Huaisi Xu <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Huaisi Xu <[email protected]> Gerrit-Reviewer: Juan Yu <[email protected]> Gerrit-HasComments: Yes
