[
https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172040#comment-16172040
]
ASF GitHub Bot commented on DRILL-5795:
---------------------------------------
GitHub user dprofeta opened a pull request:
https://github.com/apache/drill/pull/949
DRILL-5795: Parquet Filter push down at rowgroup level
Before this commit, the filter was pruning complete files. When a file
is composed of multiple rowgroups, it was not able to prune one
rowgroup from the file. Now, when the filter find that a rowgroup
doesn't match it will be remove from the scan.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dprofeta/drill drill-5795
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/949.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #949
----
commit eed3395647b10d06edf86ba4378995e9fd8da83d
Author: Damien Profeta <[email protected]>
Date: 2017-09-15T18:01:58Z
Parquet Filter push down now work at rowgroup level
Before this commit, the filter was pruning complete files. When a file
is composed of multiple rowgroups, it was not able to prune one
rowgroup from the file. Now, when the filter find that a rowgroup
doesn't match it will be remove from the scan.
----
> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
> Key: DRILL-5795
> URL: https://issues.apache.org/jira/browse/DRILL-5795
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Damien Profeta
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the
> case of one rowgroup per parquet file. In the case of multiple rowgroups per
> files, it detects that the rowgroup can be pruned but then tell to the
> drillbit to read the whole file which leads to performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and
> still read only the relevant subset of data without ending with more file
> than really needed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)