Qifan Chen has uploaded a new patch set (#25). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
......................................................................

[WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages 
based on equi-join predicate

This patch adds the logic to utilize min/max stats for Parquet row
groups or pages to skip these entities when they don't qualify an
equi-join predicate.

A new class of predicates called overlap predicates is introduced to aid
in the determination of whether a Parquet row group or a page overlap
with a range computed from the hash join. If not, then the entire
Parquet row group or the page are skipped.

The new class of predicates co-exist with the existing min/max conjuncts
that are introduced based on the local or transitive scan predicates.
Both classes of predicates can work individually or together with each
other. The overlap predicates are evaluated after the existing min/max
conjuncts.

Two new run-time profile counters are added for the number of row groups
or pages filtered via the overlap predicates respectively:
  1. NumMinMaxFilteredRowGroups
  2. NumMinMaxFilteredPages

Testing:
1. Added data type specific overlap method tests in
   min-max-filter-test.cc (boolean, int, string, date, timestamp and
   decimal);
2. Unit tested on various column types (int, bigint, string
   and decimal) with TPCH tables. Benefits were significant when the
   join column on the outer table is sorted, and somewhat observable
   when the min/max boundary values of the pages or row groups are
   monotonic;
3. Added new tests in min_max_filters.test (invoked from
   test_runtime_filters.py) to demonstrate filtered pages in
   run-time counter NumMinMaxFilteredPage.

TBD:
1. Convert remaining unit tests into query tests;
2. Performance measurement;
3. Checkout the effect of implicit casting in join predicate on
   overlap evaluation;
4. Compute a usefulness score for the overlap predicate and integrate
   it into MAX_NUM_RUNTIME_FILTERS limit;
5. Core testing.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/decimal-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/datasets/tpch/tpch_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
24 files changed, 1,046 insertions(+), 153 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/25
--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to