Xuebin Su has uploaded a new patch set (#12). ( 
http://gerrit.cloudera.org:8080/23012 )

Change subject: IMPALA-9874: Skip IO for late materialized columns
......................................................................

IMPALA-9874: Skip IO for late materialized columns

Previously, IO skipping only worked with statistics and dictionaries.
Specifically,
- RowGroups in which the column chunk statistics or dictionaries do not
  pass the predicates can be skipped, and
- Pages whose statistics in the ColumnIndex do not pass the predicates
  can be skipped.
However, we could not skip IO when all the statistics and dictionaries
passed the predicates.

This patch mitigates the issue by implementing IO skipping based on late
materialization. Specifically, for each late materialized column,
`StartScan()` will not be called until after filtering the scratch
batch, and will be skipped if no row in the current row group is
selected. When `StartScan()` is skipped, no IO occurs for the column
chunk. As a result, IO bound queries with low selectivity can run
significantly faster.

Testing:
- Added e2e tests in test_parquet_late_materialization.py to make sure
  that TotalBytesRead is reduced with late materialization.

Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/orc/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/scratch-tuple-batch.h
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
M tests/query_test/test_parquet_late_materialization.py
12 files changed, 259 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/12/23012/12
--
To view, visit http://gerrit.cloudera.org:8080/23012
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d
Gerrit-Change-Number: 23012
Gerrit-PatchSet: 12
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>

Reply via email to