Xuebin Su has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/23012 )
Change subject: IMPALA-9874: Skip IO for late materialized columns ...................................................................... IMPALA-9874: Skip IO for late materialized columns This patch implements IO skipping at column chunk level for Parquet tables. Specifically, for late materialized columns, `StartScan()` will not be called until after evaluating the predicates, and will be skipped if no row in the current row group is selected. As a result, IO bound queries with low selectivity can run significantly faster. Testing: - Added e2e tests in test_parquet_late_materialization.py to make sure that TotalBytesRead is reduced with late materialization. Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/orc/hdfs-orc-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-chunk-reader.h M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-complex-column-reader.h M be/src/exec/parquet/parquet-page-reader.cc M be/src/exec/parquet/parquet-page-reader.h M be/src/exec/scratch-tuple-batch.h M testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test M tests/query_test/test_parquet_late_materialization.py 14 files changed, 239 insertions(+), 41 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/12/23012/5 -- To view, visit http://gerrit.cloudera.org:8080/23012 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d Gerrit-Change-Number: 23012 Gerrit-PatchSet: 5 Gerrit-Owner: Xuebin Su <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
