[GitHub] [spark] tanelk commented on a change in pull request #29871: [SPARK-32995][SQL] CostBasedJoinReorder optimizer rule should be idempotent

GitBox Mon, 28 Sep 2020 10:01:21 -0700


tanelk commented on a change in pull request #29871:
URL: https://github.com/apache/spark/pull/29871#discussion_r496097705




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
##########
@@ -206,13 +206,15 @@ object JoinReorderDP extends PredicateHelper with Logging 
{
       filters: Option[JoinGraphInfo]): JoinPlanMap = {
 
     val nextLevel = new JoinPlanMap
-    var k = 0
     val lev = existingLevels.length - 1
+    var k = lev
     // Build plans for the next level from plans at level k (one side of the 
join) and level
     // lev - k (the other side of the join).
-    // For the lower level k, we only need to search from 0 to lev - k, 
because when building
+    // For the higher level k, we only need to search from lev to lev - k, 
because when building
     // a join from A and B, both A J B and B J A are handled.
-    while (k <= lev - k) {
+    // Start searching from highest level to make sure that optimally ordered 
input doesn't get
+    // reordered into another plan with the same cost.

Review comment:
       This was also my first idea, but I couldn't figure it out.
   
   I'll try to give two examples, where all pairwise joins have the same cost 
and we can join only plans with consecutive letters (AB, but bot AC)
   Keep in mind, that we try build left-deep trees:
   
https://github.com/apache/spark/blob/4619acc3ce72a6df9ac849d045ecdbed56a562be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala#L301
   
   Let the inital join be ((AB)C) - firstly join A and B and then join with C.
   Lets look at the candidates generated at each iteration: 
   
   Firstly if we kept the first candidate (current code)
   
   1. iteration (the initial state):
   (A)  (B)  (C)   
   
   2. iteration (nested loop over the single elements):
   (AB)  (BC)   
   Disgarded (BA) and (CB), because they were late
   
   3. iteration (outer loop is single element, inner is two element):
   ((BC)A)
   Disgarded ((AB)C) because it was late
   
   Secondly if we kept the last one (possible change, that I considered)
   1. iteration (the initial state):
   (A)  (B)  (C)   
   
   2. iteration (nested loop over the single elements):
   (BA)  (CB)   
   Disgarded (AB) and (BC), because they were overwritten
   
   3. iteration (outer loop is single element, inner is two element):
   ((BA)C)
   Disgarded ((CB)A), because it was overwritten
   
   Neither of these gave the original result. But lets try iterating from the 
largest (while keeping the first candidate). This is my proposition.
   
   1. iteration (the initial state):
   (A)  (B)  (C)   
   The initial state
   
   2. iteration (nested loop over the single elements):
   (AB)  (BC)   
   Disgarded (BA) and (CB), because they were late
   
   3. iteration (outer loop is two element, inner is single element):
   ((AB)C)
   Disgarded ((BC)A) because it was late
   
   Now, if we would add another plan D, then in the 4. iteration it would be 
appended to the ((AB)C) and we would still have correct result.
   
   Note that if the input would be ((AB)(CD)), then the result would also be 
(((AB)C)D), but we must be stable only on second time this is applied and 
because we output left-deep trees, then we must be stable on left-deep trees, 
to be idempotent. 
   
   Also by left-deep we mean "as left-deep" as possible. 
   
   And finally, yes, I know that this dummy example is no rigorous proof, but 
we have 2 things going for us:
   1. Out of all the test cases we have, none broke this rule
   2. The idempotentsi test is used only in UTs, so there is no chance of 
breaking the user side by marking this as  idempotent. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tanelk commented on a change in pull request #29871: [SPARK-32995][SQL] CostBasedJoinReorder optimizer rule should be idempotent

Reply via email to