Quanlong Huang has uploaded a new patch set (#12) to the change originally created by Norbert Luksa. ( http://gerrit.cloudera.org:8080/15403 )
Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner ...................................................................... IMPALA-6505: Min-Max predicate push down in ORC scanner In planning phase, the planner collects and generates min-max predicates that can be evaluated on parquet file statistics. We can easily extend this on ORC tables. This commit implements min/max predicate pushdown for the ORC scanner leveraging on the external ORC library's search arguments. We build the search arguments when we open the scanner as we need not to modify them later. Also added a new query option orc_read_statistics, similar to parquet_read_statistics. If the option is set to true (it is by default) predicate pushdown will take effect, otherwise it will be skipped. The predicates will be evaluated at ORC row group level, i.e. by default for every 10,000 rows. Limitations: - Min-max predicates on CHAR/VARCHAR types are not pushed down due to inconsistent behaviors on padding/truncating between Hive and Impala. (IMPALA-10882) - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915). - Min-max predicates having different arg types are not pushed down (IMPALA-10916). - Min-max predicates with non-literal const exprs are not pushed down since SearchArgument interfaces only accept literals. This only happens when expr rewrites are disabled thus constant folding is disabled. Tests: - Add e2e tests similar to test_parquet_stats to verify that predicates are pushed down. - Run CORE tests - Run TPCH benchmark, there is no improvement, nor regression. On the other hand, certain selective queries gained significant speed-up, e.g. select count(*) from lineitem where l_orderkey = 1. Change-Id: I136622413db21e0941d238ab6aeea901a6464845 --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exprs/scalar-expr.h M be/src/service/query-options.cc M be/src/service/query-options.h M bin/impala-config.sh M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test A testdata/workloads/functional-query/queries/QueryTest/orc-stats.test M tests/query_test/test_nested_types.py A tests/query_test/test_orc_stats.py 13 files changed, 915 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/15403/12 -- To view, visit http://gerrit.cloudera.org:8080/15403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845 Gerrit-Change-Number: 15403 Gerrit-PatchSet: 12 Gerrit-Owner: Norbert Luksa <[email protected]> Gerrit-Reviewer: Anonymous Coward (520) Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Norbert Luksa <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
