[GitHub] [drill] vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins

GitBox Sat, 14 Mar 2020 02:27:00 -0700

vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for 
all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599032547
 
 
   @paul-rogers, this pull request enables the format plugin to gather 
metadata. Metadata gathering logic was added in DRILL-7273.
   
   Regarding the schema, when metadata is collecting, rules are the same as for 
regular select queries - Drill tries to infer the table schema or uses 
user-provided schema.
   
   Collecting metadata logic may become clearer after reading this section of 
docs: 
https://github.com/apache/drill/blob/master/docs/dev/MetastoreAnalyze.md#analyze-operators-description
 or this design doc: 
https://docs.google.com/document/d/14pSIzKqDltjLEEpEebwmKnsDPxyS_6jGrPOjXu6M_NM/edit?usp=sharing
   In short, yes, we use a reader that reads all the data and downstream 
operators for transforming and storing its statistics.
   
   > For files that need a provided schema (CSV, say), do we apply stats to the 
columns after type conversion, or are stats gathered on the raw text values? 
That is, does this work use the provided schema if available?
   
   Yes, we apply stats to the columns after schema conversion, so such stats as 
min/max would have correct values in the scope of natural ordering.
   
   > How does the provided schema relate to the metadata schema?
   
   After the provided schema is used in the scan, Drill will use the resolved 
schema for columns and store it to the metastore.
   
   > What stats will we gather for non-Parquet files? How will we use them? 
Looks like there is code for partitions (have not looked in depth, so I may be 
wrong). Are we using stats for partition pruning? If so, how does that differ 
from the existing practice of just walking the directory tree?
   
   We collect exactly the same stats for non-parquet files. We may use them in 
the same way as it is used in parquet - prune files when filter for specific 
columns is specified, prune unneeded files for limit queries. Dirs pruning 
would still work in the same way as it worked before changes (it also works for 
parquet).
   I think some tests in `TestMetastoreWithEasyFormatPlugin` will help to 
understand which optimizations are added.
   
   > Do you see any potential conflicts between your metadata model and the 
above provided schema model?
   
   Looks like there shouldn't be any conflicts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins

Reply via email to