Qifan Chen has uploaded a new patch set (#14). (
http://gerrit.cloudera.org:8080/17075 )
Change subject: IMPALA-10494: Making use of the min/max column stats to improve
min/max filters
......................................................................
IMPALA-10494: Making use of the min/max column stats to improve min/max filters
This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.
The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
1. compute_column_minmax_stats
2. show_column_minmax_stats
When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.
set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column | Type | #Falses | Min | Max |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk | INT | -1 | 28800 | 75599 |
| ss_item_sk | BIGINT | -1 | 1 | 18000 |
| ss_customer_sk | INT | -1 | 1 | 100000 |
| ss_cdemo_sk | INT | -1 | 15 | 1920797 |
| ss_hdemo_sk | INT | -1 | 1 | 7200 |
| ss_addr_sk | INT | -1 | 1 | 50000 |
| ss_store_sk | INT | -1 | 1 | 10 |
| ss_promo_sk | INT | -1 | 1 | 300 |
| ss_ticket_number | BIGINT | -1 | 1 | 240000 |
| ss_quantity | INT | -1 | 1 | 100 |
| ss_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_list_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_sales_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_discount_amt | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_sales_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_list_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_tax | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_coupon_amt | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_paid | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_paid_inc_tax | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_profit | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_sold_date_sk | INT | -1 | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+
Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.
Testing:
- Added TestLowAndHighValueShort and TestLowAndHighValueInt to
IncrStatsUtilTest;
- Add new tests in overlap_min_max_filters.test to demonstrate the
usefulness of column stats to quickly disable useless filters;
- Tested compute/show stats for integers, float and double column data
types;
- core tests.
TODO:
1. Test compute stats for timestamp and date columns;
2. Add logic to disable min/max filters inside HJ builder via
the column stats.
Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
M
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
M tests/metadata/test_compute_stats.py
32 files changed, 1,007 insertions(+), 81 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/14
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>