[GitHub] [drill] paul-rogers commented on a change in pull request #1993: DRILL-7601: Shift column conversion to reader from scan framework

GitBox Tue, 10 Mar 2020 11:13:11 -0700

paul-rogers commented on a change in pull request #1993: DRILL-7601: Shift 
column conversion to reader from scan framework
URL: https://github.com/apache/drill/pull/1993#discussion_r390500706


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/avro/AvroBatchReader.java
 ##########
 @@ -69,11 +69,15 @@ public boolean open(FileScanFramework.FileSchemaNegotiator 
negotiator) {
       negotiator.userName(), 
negotiator.context().getFragmentContext().getQueryUserName());
 
     logger.debug("Avro file schema: {}", reader.getSchema());
-    TupleMetadata schema = AvroSchemaUtil.convert(reader.getSchema());
-    logger.debug("Avro file converted schema: {}", schema);
-    negotiator.setTableSchema(schema, true);
+    TupleMetadata readerSchema = AvroSchemaUtil.convert(reader.getSchema());
+    logger.debug("Avro file converted schema: {}", readerSchema);
+    TupleMetadata providedSchema = negotiator.providedSchema();
+    TupleMetadata tableSchema = 
StandardConversions.mergeSchemas(providedSchema, readerSchema);
 
 Review comment:
   You have identified a tricky area. The planner is the correct place to 
handle scan output schema: then the reader takes that output schema as gospel. 
Here we have readers that have a strong opinion about what schema they can 
produce; the provided schema is mostly a "suggestion." So, were in an akward 
middle area.
   
   In particular, the provided schema can include columns that the reader does 
not know about. Imagine the case of an Avro file where some files use schema 
v1, while others use schema 2 with an additional column. The v1 readers can't 
provide that extra column. Filling in missing columns is done via the scan 
framework. (Maybe this does not occur in Avro; maybe the Avro schema handles 
these cases. But, the issue does arise in CSV, JSON and others.)
   
   So, at the very least, the reader is not responsible for any column except 
those it can produce, any "extra" columns are handled by the scan framework.
   
   In the ideal case, the provided schema says to the reader, "of the ways you 
know how to convert a column, use **this** one." So, the reader has to know the 
target type of the column.
   
   All this said, my goal in a later PR is to reshuffle some of this stuff 
again. Now that we have multiple examples, is there some work that can be 
pulled out and reused? For example, can we build an explicit mapping from the 
reader's idea of the schema to the provided schema? Or, does it turn out that 
the reader's idea of types are unique to each reader so there is no common 
mechanism? For example, the Avro code looks quite specific to Avro and is 
perhaps not easily reused.
   
   Suggestions?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1993: DRILL-7601: Shift column conversion to reader from scan framework

Reply via email to