[
https://issues.apache.org/jira/browse/DRILL-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173534#comment-16173534
]
ASF GitHub Bot commented on DRILL-5795:
---------------------------------------
Github user parthchandra commented on a diff in the pull request:
https://github.com/apache/drill/pull/949#discussion_r140033471
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
---
@@ -819,63 +827,64 @@ private void init() throws IOException {
}
}
rowGroupInfo.setEndpointByteMap(endpointByteMap);
+ rowGroupInfo.setColumns(rg.getColumns());
rgIndex++;
rowGroupInfos.add(rowGroupInfo);
}
}
this.endpointAffinities =
AffinityCreator.getAffinityMap(rowGroupInfos);
+ updatePartitionColTypeMap();
+ }
+ private void updatePartitionColTypeMap() {
columnValueCounts = Maps.newHashMap();
this.rowCount = 0;
boolean first = true;
- for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) {
- for (RowGroupMetadata rowGroup : file.getRowGroups()) {
- long rowCount = rowGroup.getRowCount();
- for (ColumnMetadata column : rowGroup.getColumns()) {
- SchemaPath schemaPath =
SchemaPath.getCompoundPath(column.getName());
- Long previousCount = columnValueCounts.get(schemaPath);
- if (previousCount != null) {
- if (previousCount != GroupScan.NO_COLUMN_STATS) {
- if (column.getNulls() != null) {
- Long newCount = rowCount - column.getNulls();
- columnValueCounts.put(schemaPath,
columnValueCounts.get(schemaPath) + newCount);
- }
- }
- } else {
+ for (RowGroupInfo rowGroup : this.rowGroupInfos) {
--- End diff --
Isn't this doing the same thing as the original code? RowGroupInfos is
built from the RowGroupMetadata in the files?
> Filter pushdown for parquet handles multi rowgroup file
> -------------------------------------------------------
>
> Key: DRILL-5795
> URL: https://issues.apache.org/jira/browse/DRILL-5795
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Damien Profeta
> Assignee: Damien Profeta
> Labels: doc-impacting
>
> DRILL-1950 implemented the filter pushdown for parquet file but only in the
> case of one rowgroup per parquet file. In the case of multiple rowgroups per
> files, it detects that the rowgroup can be pruned but then tell to the
> drillbit to read the whole file which leads to performance issue.
> Having multiple rowgroup per file helps to handle partitioned dataset and
> still read only the relevant subset of data without ending with more file
> than really needed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)