[ https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173534#comment-16173534 ]
ASF GitHub Bot commented on DRILL-5795: --------------------------------------- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/949#discussion_r140033471 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -819,63 +827,64 @@ private void init() throws IOException { } } rowGroupInfo.setEndpointByteMap(endpointByteMap); + rowGroupInfo.setColumns(rg.getColumns()); rgIndex++; rowGroupInfos.add(rowGroupInfo); } } this.endpointAffinities = AffinityCreator.getAffinityMap(rowGroupInfos); + updatePartitionColTypeMap(); + } + private void updatePartitionColTypeMap() { columnValueCounts = Maps.newHashMap(); this.rowCount = 0; boolean first = true; - for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) { - for (RowGroupMetadata rowGroup : file.getRowGroups()) { - long rowCount = rowGroup.getRowCount(); - for (ColumnMetadata column : rowGroup.getColumns()) { - SchemaPath schemaPath = SchemaPath.getCompoundPath(column.getName()); - Long previousCount = columnValueCounts.get(schemaPath); - if (previousCount != null) { - if (previousCount != GroupScan.NO_COLUMN_STATS) { - if (column.getNulls() != null) { - Long newCount = rowCount - column.getNulls(); - columnValueCounts.put(schemaPath, columnValueCounts.get(schemaPath) + newCount); - } - } - } else { + for (RowGroupInfo rowGroup : this.rowGroupInfos) { --- End diff -- Isn't this doing the same thing as the original code? RowGroupInfos is built from the RowGroupMetadata in the files? > Filter pushdown for parquet handles multi rowgroup file > ------------------------------------------------------- > > Key: DRILL-5795 > URL: https://issues.apache.org/jira/browse/DRILL-5795 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: Damien Profeta > Assignee: Damien Profeta > Labels: doc-impacting > > DRILL-1950 implemented the filter pushdown for parquet file but only in the > case of one rowgroup per parquet file. In the case of multiple rowgroups per > files, it detects that the rowgroup can be pruned but then tell to the > drillbit to read the whole file which leads to performance issue. > Having multiple rowgroup per file helps to handle partitioned dataset and > still read only the relevant subset of data without ending with more file > than really needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)