paul-rogers commented on a change in pull request #1993: DRILL-7601: Shift
column conversion to reader from scan framework
URL: https://github.com/apache/drill/pull/1993#discussion_r390500706
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/avro/AvroBatchReader.java
##########
@@ -69,11 +69,15 @@ public boolean open(FileScanFramework.FileSchemaNegotiator
negotiator) {
negotiator.userName(),
negotiator.context().getFragmentContext().getQueryUserName());
logger.debug("Avro file schema: {}", reader.getSchema());
- TupleMetadata schema = AvroSchemaUtil.convert(reader.getSchema());
- logger.debug("Avro file converted schema: {}", schema);
- negotiator.setTableSchema(schema, true);
+ TupleMetadata readerSchema = AvroSchemaUtil.convert(reader.getSchema());
+ logger.debug("Avro file converted schema: {}", readerSchema);
+ TupleMetadata providedSchema = negotiator.providedSchema();
+ TupleMetadata tableSchema =
StandardConversions.mergeSchemas(providedSchema, readerSchema);
Review comment:
You have identified a tricky area. The planner is the correct place to
handle scan output schema: then the reader takes that output schema as gospel.
Here we have readers that have a strong opinion about what schema they can
produce; the provided schema is mostly a "suggestion." So, were in an akward
middle area.
In particular, the provided schema can include columns that the reader does
not know about. Imagine the case of an Avro file where some files use schema
v1, while others use schema 2 with an additional column. The v1 readers can't
provide that extra column. Filling in missing columns is done via the scan
framework. (Maybe this does not occur in Avro; maybe the Avro schema handles
these cases. But, the issue does arise in CSV, JSON and others.)
So, at the very least, the reader is not responsible for any column except
those it can produce, any "extra" columns are handled by the scan framework.
In the ideal case, the provided schema says to the reader, "of the ways you
know how to convert a column, use **this** one." So, the reader has to know the
target type of the column.
All this said, my goal in a later PR is to reshuffle some of this stuff
again. Now that we have multiple examples, is there some work that can be
pulled out and reused? For example, can we build an explicit mapping from the
reader's idea of the schema to the provided schema? Or, does it turn out that
the reader's idea of types are unique to each reader so there is no common
mechanism? For example, the Avro code looks quite specific to Avro and is
perhaps not easily reused.
Suggestions?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services