Thomas Tauber-Marshall has posted comments on this change.
Change subject: IMPALA-2805: Order conjuncts based on selectivity and cost
......................................................................
Patch Set 15:
(1 comment)
File Edit Options Buffers Tools Help
TPCH results from the latest version of the code:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) |
Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH() | parquet / none / none | 3.15 | -0.46% | 2.65 |
-0.75% |
+----------+-----------------------+---------+------------+------------+----------------+
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) |
Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| TPCH() | TPCH-Q14 | parquet / none / none | 2.63 | 2.55 |
+3.19% | 1.68% | 0.82% | 1 | 5 |
| TPCH() | TPCH-Q22 | parquet / none / none | 1.18 | 1.15 |
+2.92% | 4.60% | 2.52% | 1 | 5 |
| TPCH() | TPCH-Q15 | parquet / none / none | 4.04 | 3.94 |
+2.60% | 3.02% | 1.81% | 1 | 5 |
| TPCH() | TPCH-Q3 | parquet / none / none | 4.13 | 4.09 |
+1.00% | 1.27% | 1.07% | 1 | 5 |
| TPCH() | TPCH-Q19 | parquet / none / none | 4.23 | 4.19 |
+0.94% | 1.03% | 1.77% | 1 | 5 |
| TPCH() | TPCH-Q18 | parquet / none / none | 5.40 | 5.37 |
+0.58% | 0.80% | 1.37% | 1 | 5 |
| TPCH() | TPCH-Q2 | parquet / none / none | 1.24 | 1.23 |
+0.58% | 4.61% | 1.74% | 1 | 5 |
| TPCH() | TPCH-Q17 | parquet / none / none | 3.46 | 3.44 |
+0.53% | 1.34% | 1.89% | 1 | 5 |
| TPCH() | TPCH-Q21 | parquet / none / none | 9.05 | 9.02 |
+0.40% | 0.50% | 0.68% | 1 | 5 |
| TPCH() | TPCH-Q20 | parquet / none / none | 3.09 | 3.10 |
-0.43% | 1.72% | 1.99% | 1 | 5 |
| TPCH() | TPCH-Q5 | parquet / none / none | 3.55 | 3.57 |
-0.51% | 0.11% | 1.11% | 1 | 5 |
| TPCH() | TPCH-Q7 | parquet / none / none | 4.07 | 4.10 |
-0.60% | 0.83% | 1.55% | 1 | 5 |
| TPCH() | TPCH-Q9 | parquet / none / none | 4.80 | 4.84 |
-0.82% | 0.76% | 1.85% | 1 | 5 |
| TPCH() | TPCH-Q4 | parquet / none / none | 3.77 | 3.80 |
-0.88% | 1.50% | 3.12% | 1 | 5 |
| TPCH() | TPCH-Q8 | parquet / none / none | 3.75 | 3.80 |
-1.27% | 1.46% | 1.15% | 1 | 5 |
| TPCH() | TPCH-Q13 | parquet / none / none | 1.83 | 1.86 |
-1.63% | 0.16% | 1.08% | 1 | 5 |
| TPCH() | TPCH-Q12 | parquet / none / none | 3.43 | 3.49 |
-1.73% | 2.01% | 1.86% | 1 | 5 |
| TPCH() | TPCH-Q10 | parquet / none / none | 3.70 | 3.79 |
-2.32% | 0.58% | 1.33% | 1 | 5 |
| TPCH() | TPCH-Q6 | parquet / none / none | 2.41 | 2.49 |
-3.00% | 0.96% | 0.91% | 1 | 5 |
| TPCH() | TPCH-Q16 | parquet / none / none | 1.17 | 1.21 |
-3.15% | 1.87% | 4.64% | 1 | 5 |
| TPCH() | TPCH-Q1 | parquet / none / none | 3.67 | 3.81 |
-3.84% | 1.13% | 3.97% | 1 | 5 |
| TPCH() | TPCH-Q11 | parquet / none / none | 1.29 | 1.35 |
-4.75% | 1.72% | 3.07% | 1 | 5 |
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
I did 5 iterations this time, so there's a little less noise. The only query
that shows an increase of > 1 std is Q14, which wasn't affected by this change
so that probably reflects random variation.
The tpch queries that did have their plans change:
Q6:
- predicates: l_shipdate >= '1994-01-01', l_shipdate < '1995-01-01',
l_discount >= 0.05, l_discount <= 0.07, l_quantity < 24
+ predicates: l_discount >= 0.05, l_discount <= 0.07, l_quantity < 24,
l_shipdate >= '1994-01-01', l_shipdate < '1995-01-01'
All we did here is more numeric comparison in front of string comparisons. We
don't have selectivity estimates for any of these, so this seems to be clearly
the best that we can do under our current model.
Q16:
-| predicates: p_brand != 'Brand#45', NOT p_type LIKE 'MEDIUM POLISHED%',
p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
+| predicates: p_size IN (49, 14, 23, 45, 19, 3, 36, 9), p_brand !=
'Brand#45', NOT p_type LIKE 'MEDIUM POLISHED%'
We've discussed this one a lot, but to sum up: we know that the IN is highly
selective and less expensive, so putting it in front is what we want.
Q17:
-| | predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
+| | predicates: p_container = 'MED BOX', p_brand = 'Brand#23'
As two string comparisons of similar length, we're estimating their costs as
about the same, but we (correctly) estimate that "MED BOX" is more selective,
so we want that one first.
There are a few other tpch plans that have multiple conjuncts under the same
plan node where we're not making any changes to the order, but none that look
obviously like they would benefit from being reordered (most have just two
conjuncts that are eit\
her both string comparisons or both numeric comparisons), so I'm reasonably
confident this shows we're making good ordering decisions, at least for tpch.
I'll have new results up for tpcds soon.
http://gerrit.cloudera.org:8080/#/c/2598/15/testdata/workloads/functional-planner/queries/PlannerTest/joins.test
File testdata/workloads/functional-planner/queries/PlannerTest/joins.test:
Line 1648: | hash predicates: string_col = string_col, bigint_col = bigint_col
> what happened?
Originally, we had just hard coded the idea that string comparisons are more
expensive than numeric comparisons (with VAR_LENGTH_BINARY_PRED_COST), and we
don't have selectivity for either, so we had reordered these.
All of the strings in both string_cols here are of length 1, though, so when I
moved to basing string comparison cost off of string length, the cost of the
string_col comparison went down a lot, which is reasonable - comparing two
strings of length 1 is roughly the same cost as comparing two integers.
However, we're actually estimating the cost of the string comparison here as 3x
less than the int comparison, which is probably not right and points to a flaw
in my string comparison cost calculation - I'm only using the string lengths,
and not adding in the child costs as I do everywhere else, which is wrong (eg.
because a string in a comparison might actually be produced by a complicated
subexpression and we shouldn't drop the cost of that subexpression), so I'll
fix this and submit a new version.
--
To view, visit http://gerrit.cloudera.org:8080/2598
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes