[jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file

ASF GitHub Bot (JIRA) Wed, 20 Sep 2017 10:24:38 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173533#comment-16173533
 ]


ASF GitHub Bot commented on DRILL-5795:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/949#discussion_r140036046
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
    @@ -1095,7 +1104,7 @@ public GroupScan applyFilter(LogicalExpression 
filterExpr, UdfUtilities udfUtili
     
         final Set<SchemaPath> schemaPathsInExpr = filterExpr.accept(new 
ParquetRGFilterEvaluator.FieldReferenceFinder(), null);
     
    -    final List<RowGroupMetadata> qualifiedRGs = new 
ArrayList<>(parquetTableMetadata.getFiles().size());
    +    final List<RowGroupInfo> qualifiedRGs = new 
ArrayList<>(rowGroupInfos.size());
    --- End diff --
    
    Never mind the previous comment. It's probably better to use RowGroupInfos 
throughout the code. 


> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
>                 Key: DRILL-5795
>                 URL: https://issues.apache.org/jira/browse/DRILL-5795
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>              Labels: doc-impacting
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the 
> case of one rowgroup per parquet file. In the case of multiple rowgroups per 
> files, it detects that the rowgroup can be pruned but then tell to the 
> drillbit to read the whole file which leads to performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and 
> still read only the relevant subset of data without ending with more file 
> than really needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file

Reply via email to