prakharjain09 opened a new pull request #28881:
URL: https://github.com/apache/spark/pull/28881


   ### What changes were proposed in this pull request?
   Create a single post-order rule for ReuseExchange and ReuseSubquery which 
traverses the plan in 1 single post order and replaces duplicated nodes with 
ReusedExchangeExec, ReuseSubqueryExec.
   
   This fixes the `ReusedExchangeExec Reference issue` where a 
ReusedExchangeExec points to an Exchange which doesn't exist in entire query 
plan.
   
   
   ### Why are the changes needed?
   
   Currently Spark do 3 iterations on plan to identify and replace nodes which 
can be ReusedExchangeExec and ReusedSubqueryExec:
   Phase-1: First one is done in ReuseExchange rule to replace Exchange with 
ReusedExchangeExec. 
   Phase-2: Seconds one is introduces by DPP in ReuseExchange rule to find out 
all the InSubqueryExec and traverse the plans inside it and replace relevant 
Exchange with ReusedSubqueryExec. 
   Phase-3: Third we do in ReuseSubquery rule to identify 
ExecSubqueryExpression which are reusable and replace them with 
ReuseSubqueryExec.
   
   When any change is done by Phase-2/Phase-3 in a subtree of Exchange, then 
the id of exchange will change. and sometimes this leads to another 
ReusedExchangeExec pointing to Exchange which doesn't exist in plan.
   
   Example: Suppose this is the plan after Phase-1 when we try to do self join 
of a view.
   
                                        SORTMERGEJOIN         
              Exchange (id=1234)                          ReusedExchangeExec 
(points-to-id=1234)
                             |
                        ChildSubtree
   
   Suppose ChildSubtree has DPP applied inside it. So Phase-2 will try to 
convert plan inside InSubqueryExec to use ReuseBroadcast and in that process, 
complete hierarchy of ChildSubtree will also change. i.e.
   
                                        SORTMERGEJOIN         
              Exchange (id=1878)                        ReusedExchangeExec 
(points-to-id=1234)
                             |
                       NewChildSubtree
   
   But the `ReusedExchangeExec (points-to-id=1234)` is still pointing to id 
1234 and so no reuse will happen.
   
   This PR fixes this issue by merging Phase1,Phase2 and Phase3 into a single 
post order traversal.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added UTs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to