[jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file

ASF GitHub Bot (JIRA) Wed, 20 Sep 2017 10:24:44 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173534#comment-16173534
 ]


ASF GitHub Bot commented on DRILL-5795:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/949#discussion_r140033471
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
    @@ -819,63 +827,64 @@ private void init() throws IOException {
               }
             }
             rowGroupInfo.setEndpointByteMap(endpointByteMap);
    +        rowGroupInfo.setColumns(rg.getColumns());
             rgIndex++;
             rowGroupInfos.add(rowGroupInfo);
           }
         }
     
         this.endpointAffinities = 
AffinityCreator.getAffinityMap(rowGroupInfos);
    +    updatePartitionColTypeMap();
    +  }
     
    +  private void updatePartitionColTypeMap() {
         columnValueCounts = Maps.newHashMap();
         this.rowCount = 0;
         boolean first = true;
    -    for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) {
    -      for (RowGroupMetadata rowGroup : file.getRowGroups()) {
    -        long rowCount = rowGroup.getRowCount();
    -        for (ColumnMetadata column : rowGroup.getColumns()) {
    -          SchemaPath schemaPath = 
SchemaPath.getCompoundPath(column.getName());
    -          Long previousCount = columnValueCounts.get(schemaPath);
    -          if (previousCount != null) {
    -            if (previousCount != GroupScan.NO_COLUMN_STATS) {
    -              if (column.getNulls() != null) {
    -                Long newCount = rowCount - column.getNulls();
    -                columnValueCounts.put(schemaPath, 
columnValueCounts.get(schemaPath) + newCount);
    -              }
    -            }
    -          } else {
    +    for (RowGroupInfo rowGroup : this.rowGroupInfos) {
    --- End diff --
    
    Isn't this doing the same thing as the original code? RowGroupInfos is 
built from the RowGroupMetadata in the files?


> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
>                 Key: DRILL-5795
>                 URL: https://issues.apache.org/jira/browse/DRILL-5795
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Damien Profeta
>              Labels: doc-impacting
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the 
> case of one rowgroup per parquet file. In the case of multiple rowgroups per 
> files, it detects that the rowgroup can be pruned but then tell to the 
> drillbit to read the whole file which leads to performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and 
> still read only the relevant subset of data without ending with more file 
> than really needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file

Reply via email to