rednaxelafx commented on a change in pull request #23303: [SPARK-26352][SQL] 
ReorderJoin should not change the order of columns
URL: https://github.com/apache/spark/pull/23303#discussion_r241261717
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
 ##########
 @@ -99,6 +99,14 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
       } else {
         createOrderedJoin(input, conditions)
       }
+
+      if (p.sameOutput(reordered)) {
+        reordered
+      } else {
+        // Reordering the joins have changed the order of the columns.
+        // Inject a projection to make sure we restore to the expected 
ordering.
+        Project(p.output, reordered)
 
 Review comment:
   That's right, only the top-level really needs to maintain the appearance. 
But this is the easiest to implement (the change is local to the rule where 
order could have changed, so this projection is easier to understand than 
adding it elsewhere), and it doesn't affect the final result because other 
optimizer rules are actually going to get rid of the extra intermediate 
projections.
   
   e.g. if on top of the `df`, we do an extra operation:
   ```
   df.groupBy('a, 'b).agg(first('i), first('j), first('x), first('y))
   ```
   you're going to see that the extra `Project` gets optimized away in:
   ```
   === Result of Batch Operator Optimization before Inferring Filters ===
    Aggregate [a#65, b#66], [a#65, b#66, first(i#63, false) AS first(i, 
false)#121, first(j#64, false) AS first(j, false)#122, first(x#61, false) AS 
first(x, false)#123, first(y#62, false) AS first(y, false)#124]   Aggregate 
[a#65, b#66], [a#65, b#66, first(i#63, false) AS first(i, false)#121, 
first(j#64, false) AS first(j, false)#122, first(x#61, false) AS first(x, 
false)#123, first(y#62, false) AS first(y, false)#124]
   !+- Project [x#61, y#62, i#63, j#64, a#65, b#66]                             
                                                                                
                                                       +- Join Cross, (b#66 = 
i#63)
   !   +- Join Inner, ((a#65 = x#61) && (b#66 = i#63))                          
                                                                                
                                                          :- Join Inner, (a#65 
= x#61)
   !      :- Project [x#61, y#62, i#63, j#64]                                   
                                                                                
                                                          :  :- 
Relation[x#61,y#62] parquet
   !      :  +- Join Cross                                                      
                                                                                
                                                          :  +- 
Relation[a#65,b#66] parquet
   !      :     :- Relation[x#61,y#62] parquet                                  
                                                                                
                                                          +- 
Relation[i#63,j#64] parquet
   !      :     +- Relation[i#63,j#64] parquet                                  
                                                                                
                                                       
   !      +- Relation[a#65,b#66] parquet
   ```
   (a few other rules may also remove the extra `Project`)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to