[
https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173533#comment-16173533
]
ASF GitHub Bot commented on DRILL-5795:
---------------------------------------
Github user parthchandra commented on a diff in the pull request:
https://github.com/apache/drill/pull/949#discussion_r140036046
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
---
@@ -1095,7 +1104,7 @@ public GroupScan applyFilter(LogicalExpression
filterExpr, UdfUtilities udfUtili
final Set<SchemaPath> schemaPathsInExpr = filterExpr.accept(new
ParquetRGFilterEvaluator.FieldReferenceFinder(), null);
- final List<RowGroupMetadata> qualifiedRGs = new
ArrayList<>(parquetTableMetadata.getFiles().size());
+ final List<RowGroupInfo> qualifiedRGs = new
ArrayList<>(rowGroupInfos.size());
--- End diff --
Never mind the previous comment. It's probably better to use RowGroupInfos
throughout the code.
> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
> Key: DRILL-5795
> URL: https://issues.apache.org/jira/browse/DRILL-5795
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Damien Profeta
> Assignee: Damien Profeta
> Labels: doc-impacting
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the
> case of one rowgroup per parquet file. In the case of multiple rowgroups per
> files, it detects that the rowgroup can be pruned but then tell to the
> drillbit to read the whole file which leads to performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and
> still read only the relevant subset of data without ending with more file
> than really needed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)