[Impala-CR](cdh5-trunk) IMPALA-2805: Order conjuncts based on selectivity and cost

Thomas Tauber-Marshall (Code Review) Fri, 29 Apr 2016 09:57:42 -0700

Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-2805: Order conjuncts based on selectivity and cost
......................................................................



Patch Set 15:

(1 comment)

File Edit Options Buffers Tools Help                                            
                                                                                
                                                                                
             
TPCH results from the latest version of the code:

 
+----------+-----------------------+---------+------------+------------+----------------+
 | Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
 
+----------+-----------------------+---------+------------+------------+----------------+
 | TPCH()   | parquet / none / none | 3.15    | -0.46%     | 2.65       | 
-0.75%         |
 
+----------+-----------------------+---------+------------+------------+----------------+

 
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | Workload | Query          | File Format           | Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
 
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
 | TPCH()   | TPCH-Q14       | parquet / none / none | 2.63   | 2.55        |   
+3.19%   |   1.68%   |   0.82%        | 1           | 5     |
 | TPCH()   | TPCH-Q22       | parquet / none / none | 1.18   | 1.15        |   
+2.92%   |   4.60%   |   2.52%        | 1           | 5     |
 | TPCH()   | TPCH-Q15       | parquet / none / none | 4.04   | 3.94        |   
+2.60%   |   3.02%   |   1.81%        | 1           | 5     |
 | TPCH()   | TPCH-Q3        | parquet / none / none | 4.13   | 4.09        |   
+1.00%   |   1.27%   |   1.07%        | 1           | 5     |
 | TPCH()   | TPCH-Q19       | parquet / none / none | 4.23   | 4.19        |   
+0.94%   |   1.03%   |   1.77%        | 1           | 5     |
 | TPCH()   | TPCH-Q18       | parquet / none / none | 5.40   | 5.37        |   
+0.58%   |   0.80%   |   1.37%        | 1           | 5     |
 | TPCH()   | TPCH-Q2        | parquet / none / none | 1.24   | 1.23        |   
+0.58%   |   4.61%   |   1.74%        | 1           | 5     |
 | TPCH()   | TPCH-Q17       | parquet / none / none | 3.46   | 3.44        |   
+0.53%   |   1.34%   |   1.89%        | 1           | 5     |
 | TPCH()   | TPCH-Q21       | parquet / none / none | 9.05   | 9.02        |   
+0.40%   |   0.50%   |   0.68%        | 1           | 5     |
 | TPCH()   | TPCH-Q20       | parquet / none / none | 3.09   | 3.10        |   
-0.43%   |   1.72%   |   1.99%        | 1           | 5     |
 | TPCH()   | TPCH-Q5        | parquet / none / none | 3.55   | 3.57        |   
-0.51%   |   0.11%   |   1.11%        | 1           | 5     |
 | TPCH()   | TPCH-Q7        | parquet / none / none | 4.07   | 4.10        |   
-0.60%   |   0.83%   |   1.55%        | 1           | 5     |
 | TPCH()   | TPCH-Q9        | parquet / none / none | 4.80   | 4.84        |   
-0.82%   |   0.76%   |   1.85%        | 1           | 5     |
 | TPCH()   | TPCH-Q4        | parquet / none / none | 3.77   | 3.80        |   
-0.88%   |   1.50%   |   3.12%        | 1           | 5     |
 | TPCH()   | TPCH-Q8        | parquet / none / none | 3.75   | 3.80        |   
-1.27%   |   1.46%   |   1.15%        | 1           | 5     |
 | TPCH()   | TPCH-Q13       | parquet / none / none | 1.83   | 1.86        |   
-1.63%   |   0.16%   |   1.08%        | 1           | 5     |
 | TPCH()   | TPCH-Q12       | parquet / none / none | 3.43   | 3.49        |   
-1.73%   |   2.01%   |   1.86%        | 1           | 5     |
 | TPCH()   | TPCH-Q10       | parquet / none / none | 3.70   | 3.79        |   
-2.32%   |   0.58%   |   1.33%        | 1           | 5     |
 | TPCH()   | TPCH-Q6        | parquet / none / none | 2.41   | 2.49        |   
-3.00%   |   0.96%   |   0.91%        | 1           | 5     |
 | TPCH()   | TPCH-Q16       | parquet / none / none | 1.17   | 1.21        |   
-3.15%   |   1.87%   |   4.64%        | 1           | 5     |
 | TPCH()   | TPCH-Q1        | parquet / none / none | 3.67   | 3.81        |   
-3.84%   |   1.13%   |   3.97%        | 1           | 5     |
 | TPCH()   | TPCH-Q11       | parquet / none / none | 1.29   | 1.35        |   
-4.75%   |   1.72%   |   3.07%        | 1           | 5     |
 
+----------+----------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

I did 5 iterations this time, so there's a little less noise. The only query 
that shows an increase of > 1 std is Q14, which wasn't affected by this change 
so that probably reflects random variation.

The tpch queries that did have their plans change:

Q6:

-   predicates: l_shipdate >= '1994-01-01', l_shipdate < '1995-01-01', 
l_discount >= 0.05, l_discount <= 0.07, l_quantity < 24
+   predicates: l_discount >= 0.05, l_discount <= 0.07, l_quantity < 24, 
l_shipdate >= '1994-01-01', l_shipdate < '1995-01-01'

All we did here is more numeric comparison in front of string comparisons. We 
don't have selectivity estimates for any of these, so this seems to be clearly 
the best that we can do under our current model.

Q16:

-|     predicates: p_brand != 'Brand#45', NOT p_type LIKE 'MEDIUM POLISHED%', 
p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
+|     predicates: p_size IN (49, 14, 23, 45, 19, 3, 36, 9), p_brand != 
'Brand#45', NOT p_type LIKE 'MEDIUM POLISHED%'

We've discussed this one a lot, but to sum up: we know that the IN is highly 
selective and less expensive, so putting it in front is what we want.

Q17:

-|  |     predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
+|  |     predicates: p_container = 'MED BOX', p_brand = 'Brand#23'

As two string comparisons of similar length, we're estimating their costs as 
about the same, but we (correctly) estimate that "MED BOX" is more selective, 
so we want that one first.

There are a few other tpch plans that have multiple conjuncts under the same 
plan node where we're not making any changes to the order, but none that look 
obviously like they would benefit from being reordered (most have just two 
conjuncts that are eit\
her both string comparisons or both numeric comparisons), so I'm reasonably 
confident this shows we're making good ordering decisions, at least for tpch. 
I'll have new results up for tpcds soon.

http://gerrit.cloudera.org:8080/#/c/2598/15/testdata/workloads/functional-planner/queries/PlannerTest/joins.test
File testdata/workloads/functional-planner/queries/PlannerTest/joins.test:

Line 1648: |  hash predicates: string_col = string_col, bigint_col = bigint_col
> what happened?
Originally, we had just hard coded the idea that string comparisons are more 
expensive than numeric comparisons (with VAR_LENGTH_BINARY_PRED_COST), and we 
don't have selectivity for either, so we had reordered these.

All of the strings in both string_cols here are of length 1, though, so when I 
moved to basing string comparison cost off of string length, the cost of the 
string_col comparison went down a lot, which is reasonable - comparing two 
strings of length 1 is roughly the same cost as comparing two integers.

However, we're actually estimating the cost of the string comparison here as 3x 
less than the int comparison, which is probably not right and points to a flaw 
in my string comparison cost calculation - I'm only using the string lengths, 
and not adding in the child costs as I do everywhere else, which is wrong (eg. 
because a string in a comparison might actually be produced by a complicated 
subexpression and we shouldn't drop the cost of that subexpression), so I'll 
fix this and submit a new version.


-- 
To view, visit http://gerrit.cloudera.org:8080/2598
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034
Gerrit-PatchSet: 15
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2805: Order conjuncts based on selectivity and cost

Reply via email to