Xuebin Su has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/23012 )
Change subject: IMPALA-9874: Skip IO for late materialized columns ...................................................................... IMPALA-9874: Skip IO for late materialized columns Previously, IO skipping only worked with statistics and dictionaries. Specifically, - RowGroups in which the column chunk statistics or dictionaries do not pass the predicates can be skipped, and - Pages whose statistics in the ColumnIndex do not pass the predicates can be skipped. However, we could not skip IO when all the statistics and dictionaries passed the predicates. This patch mitigates the issue by implementing IO skipping based on late materialization. Specifically, for each late materialized column, `StartScan()` will not be called until after filtering the scratch batch, and will be skipped if no row in the current row group is selected. When `StartScan()` is skipped, no IO occurs for the column chunk. As a result, IO bound queries with low selectivity can run significantly faster. Testing: - Added e2e tests in test_parquet_late_materialization.py to make sure that TotalBytesRead is reduced with late materialization. Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/orc/hdfs-orc-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-chunk-reader.h M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-complex-column-reader.h M be/src/exec/scratch-tuple-batch.h M testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test M tests/query_test/test_parquet_late_materialization.py 12 files changed, 259 insertions(+), 52 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/12/23012/12 -- To view, visit http://gerrit.cloudera.org:8080/23012 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d Gerrit-Change-Number: 23012 Gerrit-PatchSet: 12 Gerrit-Owner: Xuebin Su <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Xuebin Su <[email protected]>
