Qifan Chen has uploaded a new patch set (#18). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
......................................................................

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range in both hash join builders and Parquet scanners.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner.;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
35 files changed, 1,107 insertions(+), 108 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/18
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 18
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to