[
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Miroslav Holubec updated DRILL-4601:
------------------------------------
Attachment: DRILL-4601.1.patch
> Partitioning based on the parquet statistics
> --------------------------------------------
>
> Key: DRILL-4601
> URL: https://issues.apache.org/jira/browse/DRILL-4601
> Project: Apache Drill
> Issue Type: Improvement
> Components: Query Planning & Optimization
> Reporter: Miroslav Holubec
> Labels: parquet, partitioning, planning, statistics
> Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea
> implemented in DRILL-3333 even further.
> Currently partitioning is based on statistics, when min value equals to max
> value for whole file. Based on this files are removed from scan in planning
> phase. Problem with this is, that it leads to many small parquet files, which
> is not fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if
> value should equal to constant, remove all files from plan which have
> statistics off. This will really help performance for scans over many parquet
> files.
> I have initial patch ready, currently just to give an idea (it is reusing
> metadata v2)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)