paul-rogers commented on issue #2026: DRILL-7330: Implement metadata usage for 
all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599018591
 
 
   Looks like a very cool feature. I've not been following the metadata 
implementation closely. Can you help get me up to speed by providing a bit more 
background information? What is the goal of this PR? Does it enable the format 
plugins to gather metadata if they choose, or does this PR actually add the 
metadata gathering itself?
   
   As I understand it, one of the things that the metadata framework does is to 
infer schema. Whether inferring schema for metadata, or inferring schema for a 
scan, we hit the same ambiguities. How does this code handle a schema conflict? 
Or, do we just assume the schema is whatever we get in the sample?
   
   How do we gather stats? Do we have the reader read all the data and have a 
downstream operator make sense of the data?
   
   For files that need a provided schema (CSV, say), do we apply stats to the 
columns after type conversion, or are stats gathered on the raw text values? 
That is, does this work use the provided schema if available? How does the 
provided schema relate to the metadata schema?
   
   What stats will we gather for non-Parquet files? How will we use them? Looks 
like there is code for partitions (have not looked in depth, so I may be 
wrong). Are we using stats for partition pruning? If so, how does that differ 
from the existing practice of just walking the directory tree?
   
   I think that if I understand some of this background I'll be able to do a 
more complete review. Thanks!
   
   Just so we're on the same page, I'm working on a revision to how we handle 
schema. Basically, the EVF-based operators will fully integrate the provided 
schema, and will be ready for a "defined" schema created by the planner (as in 
a classic query engine where the planner does all the schema calculations.) The 
idea is to use dynamic schema (what Drill has always done) when sampling the 
first row tells us all we need to know (as in Parquet), but to encourage a 
provided schema when sampling is not reliable (as in JSON.)
   
   This means that we have a flow something like this:
   
   ```
   User --> Provided Schema --> Scan <-- Reader <-- Input Source Schema
                                  |
                                  v
                        Scan output schema
   ```
   The scan output schema describes the data a scan will deliver. Hopefully, 
this is also the schema used by stats gathering.
   
   Do you see any potential conflicts between your metadata model and the above 
provided schema model?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to