[Impala-CR](cdh5-trunk) IMPALA-2805: Order conjuncts based on selectivity and cost

Thomas Tauber-Marshall (Code Review) Fri, 29 Apr 2016 12:02:17 -0700

Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-2805: Order conjuncts based on selectivity and cost
......................................................................



Patch Set 16:

(9 comments)

TPCDS results with the latest changes:

Plan affected by the change:
 
+----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | Workload | Query                              | File Format           | 
Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | 
Iters |
 
+----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | TPCDS()  | TPCDS-Q52                          | parquet / none / none | 0.49 
  | 0.47        |   +3.68%   |   4.24%   |   0.60%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q47                          | parquet / none / none | 2.36 
  | 2.29        |   +3.35%   |   3.80%   |   1.09%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q34                          | parquet / none / none | 0.96 
  | 0.94        |   +2.28%   |   0.96%   |   2.71%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q19                          | parquet / none / none | 0.65 
  | 0.64        |   +0.74%   |   3.22%   |   3.38%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q7                           | parquet / none / none | 2.16 
  | 2.15        |   +0.48%   |   1.79%   |   0.85%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q59                          | parquet / none / none | 1.63 
  | 1.63        |   +0.28%   |   1.51%   |   1.42%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q27                          | parquet / none / none | 2.18 
  | 2.19        |   -0.32%   |   1.26%   |   1.90%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q42                          | parquet / none / none | 0.48 
  | 0.49        |   -2.09%   |   0.59%   |   4.87%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q73                          | parquet / none / none | 0.87 
  | 0.89        |   -2.19%   |   2.07%   |   2.27%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q63                          | parquet / none / none | 0.56 
  | 0.57        |   -2.27%   |   4.98%   |   2.96%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q55                          | parquet / none / none | 0.48 
  | 0.50        |   -2.31%   |   4.80%   |   5.10%        | 1           | 5     
|

Plan not affected:
 | TPCDS()  | TPCDS-Q88                          | parquet / none / none | 5.03 
  | 4.82        |   +4.40%   |   1.71%   |   0.72%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q68                          | parquet / none / none | 1.05 
  | 1.01        |   +4.23%   |   2.15%   |   0.18%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q89                          | parquet / none / none | 0.62 
  | 0.59        |   +3.80%   |   3.23%   |   3.54%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q98                          | parquet / none / none | 0.95 
  | 0.93        |   +1.45%   |   3.86%   |   2.12%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q46                          | parquet / none / none | 1.08 
  | 1.06        |   +1.44%   |   1.24%   |   2.76%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q6                           | parquet / none / none | 0.94 
  | 0.93        |   +1.05%   |   2.25%   |   1.76%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q43                          | parquet / none / none | 0.65 
  | 0.65        |   +0.35%   |   3.19%   |   3.23%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q79                          | parquet / none / none | 0.88 
  | 0.88        |   -0.10%   |   2.65%   |   2.03%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q8                           | parquet / none / none | 0.91 
  | 0.91        |   -0.60%   |   3.52%   |   1.39%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q65                          | parquet / none / none | 1.43 
  | 1.44        |   -0.61%   |   1.07%   |   1.81%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q53                          | parquet / none / none | 0.53 
  | 0.54        |   -1.01%   |   0.72%   |   4.32%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q61                          | parquet / none / none | 1.68 
  | 1.70        |   -1.21%   |   1.75%   |   3.45%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q96                          | parquet / none / none | 0.73 
  | 0.74        |   -2.06%   |   0.31%   |   2.99%        | 1           | 5     
|
 | TPCDS()  | TPCDS-Q3                           | parquet / none / none | 0.52 
  | 0.53        |   -2.31%   |   3.25%   |   6.00%        | 1           | 5     
|
 
+----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

Analysis of plan changes posted inline with the tpcds-all planner test.

http://gerrit.cloudera.org:8080/#/c/2598/16/testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
File testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test:

Line 159: |     predicates: cd_marital_status = 'W', cd_gender = 'F', 
cd_education_status = 'Primary'
We estimate cd_marital_status and cd_gender as the same cost, but 
cd_marital_status is more selective so we move it to the front.


Line 486: |     predicates: d_year = 1999, d_moy = 11, tpcds.date_dim.d_date_sk 
>= 2451484, tpcds.date_dim.d_date_sk <= 2451513
Since these are all numeric comparisons, we estimate all of their costs as 
being the same. d_year is the most selective, so it gets moved to the front.


Line 794: |  |     predicates: date_dim.d_year IN (1998, 1998 + 1, 1998 + 2), 
(date_dim.d_dom BETWEEN 1 AND 3 OR date_dim.d_dom BETWEEN 25 AND 28)
We're estimating them as similar cost - 4 numeric comparisons for the 'BETWEEN 
OR BETWEEN'  vs. 3 comparisons plus 2 arithmeric ops for the 'IN', but the 'IN' 
is highly selective and we don't have an estimate for the 'OR', so the 'IN' 
goes in front.


Line 802: |  |     predicates: household_demographics.hd_vehicle_count > 0, 
(CASE WHEN household_demographics.hd_vehicle_count > 0 THEN 
household_demographics.hd_dep_count / household_demographics.hd_vehicle_count 
ELSE NULL END) > 1.2, (household_demographics.hd_buy_potential = '>10000' OR 
household_demographics.hd_buy_potential = 'unknown')
The '> 0' is estimated as the cheapest, the CASE is the second cheapest since 
its basically just a numeric comparison, and the 'OR' of string comparisons is 
the most expensive, which seems right.


Line 914: |     predicates: dt.d_year = 1998, dt.d_moy = 12
The costs are the same, but d_year is much more selective so it gets moved to 
the front.


Line 1681: |  hash predicates: d_week_seq = d_week_seq - 52, s_store_id = 
s_store_id
We don't have selectivity for either. s_store_ids are long strings, so we're 
estimating the cost of that comparison as more expensive.


Line 1920: |     predicates: tpcds.date_dim.d_date_sk >= 2451911, 
tpcds.date_dim.d_date_sk <= 2452275, d_month_seq IN (1212, 1212 + 1, 1212 + 2, 
1212 + 3, 1212 + 4, 1212 + 5, 1212 + 6, 1212 + 7, 1212 + 8, 1212 + 9, 1212 + 
10, 1212 + 11)
The simple numeric comparisons are clearly less expensive than the 'IN' and get 
moved to the front.


Line 3274: |  hash predicates: rank() = rank() - 1, s_store_name = 
s_store_name, i_category = i_category, s_company_name = s_company_name, i_brand 
= i_brand
We don't have any selectivities. All of these are string comparisons except 
'rank()' so it goes first. The rest are ordered by their average string length 
from the catalog.


Line 3319: |  |  predicates: d_year = 2000, avg(sum(ss_sales_price)) > 0, CASE 
WHEN avg(sum(ss_sales_price)) > 0 THEN abs(sum(ss_sales_price) - 
avg(sum(ss_sales_price))) / avg(sum(ss_sales_price)) ELSE NULL END > 0.1
The CASE is clearly the most expensive and moves to the end. The other have the 
same estimated cost, but d_year is highly selective so it stays at the front.


-- 
To view, visit http://gerrit.cloudera.org:8080/2598
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034
Gerrit-PatchSet: 16
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2805: Order conjuncts based on selectivity and cost

Reply via email to