Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-2805: Order conjuncts based on selectivity and cost ......................................................................
Patch Set 16: (9 comments) TPCDS results with the latest changes: Plan affected by the change: +----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters | +----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ | TPCDS() | TPCDS-Q52 | parquet / none / none | 0.49 | 0.47 | +3.68% | 4.24% | 0.60% | 1 | 5 | | TPCDS() | TPCDS-Q47 | parquet / none / none | 2.36 | 2.29 | +3.35% | 3.80% | 1.09% | 1 | 5 | | TPCDS() | TPCDS-Q34 | parquet / none / none | 0.96 | 0.94 | +2.28% | 0.96% | 2.71% | 1 | 5 | | TPCDS() | TPCDS-Q19 | parquet / none / none | 0.65 | 0.64 | +0.74% | 3.22% | 3.38% | 1 | 5 | | TPCDS() | TPCDS-Q7 | parquet / none / none | 2.16 | 2.15 | +0.48% | 1.79% | 0.85% | 1 | 5 | | TPCDS() | TPCDS-Q59 | parquet / none / none | 1.63 | 1.63 | +0.28% | 1.51% | 1.42% | 1 | 5 | | TPCDS() | TPCDS-Q27 | parquet / none / none | 2.18 | 2.19 | -0.32% | 1.26% | 1.90% | 1 | 5 | | TPCDS() | TPCDS-Q42 | parquet / none / none | 0.48 | 0.49 | -2.09% | 0.59% | 4.87% | 1 | 5 | | TPCDS() | TPCDS-Q73 | parquet / none / none | 0.87 | 0.89 | -2.19% | 2.07% | 2.27% | 1 | 5 | | TPCDS() | TPCDS-Q63 | parquet / none / none | 0.56 | 0.57 | -2.27% | 4.98% | 2.96% | 1 | 5 | | TPCDS() | TPCDS-Q55 | parquet / none / none | 0.48 | 0.50 | -2.31% | 4.80% | 5.10% | 1 | 5 | Plan not affected: | TPCDS() | TPCDS-Q88 | parquet / none / none | 5.03 | 4.82 | +4.40% | 1.71% | 0.72% | 1 | 5 | | TPCDS() | TPCDS-Q68 | parquet / none / none | 1.05 | 1.01 | +4.23% | 2.15% | 0.18% | 1 | 5 | | TPCDS() | TPCDS-Q89 | parquet / none / none | 0.62 | 0.59 | +3.80% | 3.23% | 3.54% | 1 | 5 | | TPCDS() | TPCDS-Q98 | parquet / none / none | 0.95 | 0.93 | +1.45% | 3.86% | 2.12% | 1 | 5 | | TPCDS() | TPCDS-Q46 | parquet / none / none | 1.08 | 1.06 | +1.44% | 1.24% | 2.76% | 1 | 5 | | TPCDS() | TPCDS-Q6 | parquet / none / none | 0.94 | 0.93 | +1.05% | 2.25% | 1.76% | 1 | 5 | | TPCDS() | TPCDS-Q43 | parquet / none / none | 0.65 | 0.65 | +0.35% | 3.19% | 3.23% | 1 | 5 | | TPCDS() | TPCDS-Q79 | parquet / none / none | 0.88 | 0.88 | -0.10% | 2.65% | 2.03% | 1 | 5 | | TPCDS() | TPCDS-Q8 | parquet / none / none | 0.91 | 0.91 | -0.60% | 3.52% | 1.39% | 1 | 5 | | TPCDS() | TPCDS-Q65 | parquet / none / none | 1.43 | 1.44 | -0.61% | 1.07% | 1.81% | 1 | 5 | | TPCDS() | TPCDS-Q53 | parquet / none / none | 0.53 | 0.54 | -1.01% | 0.72% | 4.32% | 1 | 5 | | TPCDS() | TPCDS-Q61 | parquet / none / none | 1.68 | 1.70 | -1.21% | 1.75% | 3.45% | 1 | 5 | | TPCDS() | TPCDS-Q96 | parquet / none / none | 0.73 | 0.74 | -2.06% | 0.31% | 2.99% | 1 | 5 | | TPCDS() | TPCDS-Q3 | parquet / none / none | 0.52 | 0.53 | -2.31% | 3.25% | 6.00% | 1 | 5 | +----------+------------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ Analysis of plan changes posted inline with the tpcds-all planner test. http://gerrit.cloudera.org:8080/#/c/2598/16/testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test File testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test: Line 159: | predicates: cd_marital_status = 'W', cd_gender = 'F', cd_education_status = 'Primary' We estimate cd_marital_status and cd_gender as the same cost, but cd_marital_status is more selective so we move it to the front. Line 486: | predicates: d_year = 1999, d_moy = 11, tpcds.date_dim.d_date_sk >= 2451484, tpcds.date_dim.d_date_sk <= 2451513 Since these are all numeric comparisons, we estimate all of their costs as being the same. d_year is the most selective, so it gets moved to the front. Line 794: | | predicates: date_dim.d_year IN (1998, 1998 + 1, 1998 + 2), (date_dim.d_dom BETWEEN 1 AND 3 OR date_dim.d_dom BETWEEN 25 AND 28) We're estimating them as similar cost - 4 numeric comparisons for the 'BETWEEN OR BETWEEN' vs. 3 comparisons plus 2 arithmeric ops for the 'IN', but the 'IN' is highly selective and we don't have an estimate for the 'OR', so the 'IN' goes in front. Line 802: | | predicates: household_demographics.hd_vehicle_count > 0, (CASE WHEN household_demographics.hd_vehicle_count > 0 THEN household_demographics.hd_dep_count / household_demographics.hd_vehicle_count ELSE NULL END) > 1.2, (household_demographics.hd_buy_potential = '>10000' OR household_demographics.hd_buy_potential = 'unknown') The '> 0' is estimated as the cheapest, the CASE is the second cheapest since its basically just a numeric comparison, and the 'OR' of string comparisons is the most expensive, which seems right. Line 914: | predicates: dt.d_year = 1998, dt.d_moy = 12 The costs are the same, but d_year is much more selective so it gets moved to the front. Line 1681: | hash predicates: d_week_seq = d_week_seq - 52, s_store_id = s_store_id We don't have selectivity for either. s_store_ids are long strings, so we're estimating the cost of that comparison as more expensive. Line 1920: | predicates: tpcds.date_dim.d_date_sk >= 2451911, tpcds.date_dim.d_date_sk <= 2452275, d_month_seq IN (1212, 1212 + 1, 1212 + 2, 1212 + 3, 1212 + 4, 1212 + 5, 1212 + 6, 1212 + 7, 1212 + 8, 1212 + 9, 1212 + 10, 1212 + 11) The simple numeric comparisons are clearly less expensive than the 'IN' and get moved to the front. Line 3274: | hash predicates: rank() = rank() - 1, s_store_name = s_store_name, i_category = i_category, s_company_name = s_company_name, i_brand = i_brand We don't have any selectivities. All of these are string comparisons except 'rank()' so it goes first. The rest are ordered by their average string length from the catalog. Line 3319: | | predicates: d_year = 2000, avg(sum(ss_sales_price)) > 0, CASE WHEN avg(sum(ss_sales_price)) > 0 THEN abs(sum(ss_sales_price) - avg(sum(ss_sales_price))) / avg(sum(ss_sales_price)) ELSE NULL END > 0.1 The CASE is clearly the most expensive and moves to the end. The other have the same estimated cost, but d_year is highly selective so it stays at the front. -- To view, visit http://gerrit.cloudera.org:8080/2598 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034 Gerrit-PatchSet: 16 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-HasComments: Yes
