paul-rogers commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins URL: https://github.com/apache/drill/pull/2026#issuecomment-599018591 Looks like a very cool feature. I've not been following the metadata implementation closely. Can you help get me up to speed by providing a bit more background information? What is the goal of this PR? Does it enable the format plugins to gather metadata if they choose, or does this PR actually add the metadata gathering itself? As I understand it, one of the things that the metadata framework does is to infer schema. Whether inferring schema for metadata, or inferring schema for a scan, we hit the same ambiguities. How does this code handle a schema conflict? Or, do we just assume the schema is whatever we get in the sample? How do we gather stats? Do we have the reader read all the data and have a downstream operator make sense of the data? For files that need a provided schema (CSV, say), do we apply stats to the columns after type conversion, or are stats gathered on the raw text values? That is, does this work use the provided schema if available? How does the provided schema relate to the metadata schema? What stats will we gather for non-Parquet files? How will we use them? Looks like there is code for partitions (have not looked in depth, so I may be wrong). Are we using stats for partition pruning? If so, how does that differ from the existing practice of just walking the directory tree? I think that if I understand some of this background I'll be able to do a more complete review. Thanks! Just so we're on the same page, I'm working on a revision to how we handle schema. Basically, the EVF-based operators will fully integrate the provided schema, and will be ready for a "defined" schema created by the planner (as in a classic query engine where the planner does all the schema calculations.) The idea is to use dynamic schema (what Drill has always done) when sampling the first row tells us all we need to know (as in Parquet), but to encourage a provided schema when sampling is not reliable (as in JSON.) This means that we have a flow something like this: ``` User --> Provided Schema --> Scan <-- Reader <-- Input Source Schema | v Scan output schema ``` The scan output schema describes the data a scan will deliver. Hopefully, this is also the schema used by stats gathering. Do you see any potential conflicts between your metadata model and the above provided schema model?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
