[ https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Miroslav Holubec updated DRILL-4601: ------------------------------------ Attachment: DRILL-4601.1.patch > Partitioning based on the parquet statistics > -------------------------------------------- > > Key: DRILL-4601 > URL: https://issues.apache.org/jira/browse/DRILL-4601 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning & Optimization > Reporter: Miroslav Holubec > Labels: parquet, partitioning, planning, statistics > Attachments: DRILL-4601.1.patch > > > It can really help performance to extend current partitioning idea > implemented in DRILL-3333 even further. > Currently partitioning is based on statistics, when min value equals to max > value for whole file. Based on this files are removed from scan in planning > phase. Problem with this is, that it leads to many small parquet files, which > is not fine in HDFS world. Also only few columns are partitioned. > I would like to extend this idea to use all statistics for all columns. So if > value should equal to constant, remove all files from plan which have > statistics off. This will really help performance for scans over many parquet > files. > I have initial patch ready, currently just to give an idea (it is reusing > metadata v2) -- This message was sent by Atlassian JIRA (v6.3.4#6332)