[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

Miroslav Holubec (JIRA) Wed, 13 Apr 2016 02:04:24 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Miroslav Holubec updated DRILL-4601:
------------------------------------
    Description: 
It can really help performance to extend current partitioning idea implemented 
in DRILL-3333 even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this, files are removed from scan in planning 
phase. Problem is, that it leads to many small parquet files, which is not fine 
in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea. (it changes 
metadata v2, which is not fine and also it currently supports only equal 
operation).

  was:
It can really help performance to extend current partitioning idea implemented 
in DRILL-3333 even further.
Currently partitioning is based on statistics, when min value equals to max 
value for whole file. Based on this files are removed from scan in planning 
phase. Problem with this is, that it leads to many small parquet files, which 
is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if 
value should equal to constant, remove all files from plan which have 
statistics off. This will really help performance for scans over many parquet 
files.

I have initial patch ready, currently just to give an idea (it is reusing 
metadata v2)


> Partitioning based on the parquet statistics
> --------------------------------------------
>
>                 Key: DRILL-4601
>                 URL: https://issues.apache.org/jira/browse/DRILL-4601
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>            Reporter: Miroslav Holubec
>              Labels: parquet, partitioning, planning, statistics
>         Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea 
> implemented in DRILL-3333 even further.
> Currently partitioning is based on statistics, when min value equals to max 
> value for whole file. Based on this, files are removed from scan in planning 
> phase. Problem is, that it leads to many small parquet files, which is not 
> fine in HDFS world. Also only few columns are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if 
> value should equal to constant, remove all files from plan which have 
> statistics off. This will really help performance for scans over many parquet 
> files.
> I have initial patch ready, currently just to give an idea. (it changes 
> metadata v2, which is not fine and also it currently supports only equal 
> operation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics

Reply via email to