[ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miroslav Holubec updated DRILL-4601:
------------------------------------
    Attachment: DRILL-4601.1.patch

> Partitioning based on the parquet statistics
> --------------------------------------------
>
>                 Key: DRILL-4601
>                 URL: https://issues.apache.org/jira/browse/DRILL-4601
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>            Reporter: Miroslav Holubec
>              Labels: parquet, partitioning, planning, statistics
>         Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL-3333 even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this files are removed from scan in planning 
> phase. Problem with this is, that it leads to many small parquet files, which 
> is not fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea (it is reusing 
> metadata v2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to