Miroslav Holubec created DRILL-4601:
---------------------------------------
Summary: Partitioning based on the parquet statistics
Key: DRILL-4601
URL: https://issues.apache.org/jira/browse/DRILL-4601
Project: Apache Drill
Issue Type: Improvement
Components: Query Planning & Optimization
Reporter: Miroslav Holubec
It can really help performance to extend current partitioning idea implemented
in DRILL-3333 even further.
Currently partitioning is based on statistics, when min value equals to max
value for whole file. Based on this files are removed from scan in planning
phase. Problem with this is, that it leads to many small parquet files, which
is not fine in HDFS world. Also only few columns are partitioned.
I would like to extend this idea to use all statistics for all columns. So if
value should equal to constant, remove all files from plan which have
statistics off. This will really help performance for scans over many parquet
files.
I have initial patch ready, currently just to give an idea.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)