prakharjain09 opened a new pull request #28881:
URL: https://github.com/apache/spark/pull/28881
### What changes were proposed in this pull request?
Create a single post-order rule for ReuseExchange and ReuseSubquery which
traverses the plan in 1 single post order and replaces duplicated nodes with
ReusedExchangeExec, ReuseSubqueryExec.
This fixes the `ReusedExchangeExec Reference issue` where a
ReusedExchangeExec points to an Exchange which doesn't exist in entire query
plan.
### Why are the changes needed?
Currently Spark do 3 iterations on plan to identify and replace nodes which
can be ReusedExchangeExec and ReusedSubqueryExec:
Phase-1: First one is done in ReuseExchange rule to replace Exchange with
ReusedExchangeExec.
Phase-2: Seconds one is introduces by DPP in ReuseExchange rule to find out
all the InSubqueryExec and traverse the plans inside it and replace relevant
Exchange with ReusedSubqueryExec.
Phase-3: Third we do in ReuseSubquery rule to identify
ExecSubqueryExpression which are reusable and replace them with
ReuseSubqueryExec.
When any change is done by Phase-2/Phase-3 in a subtree of Exchange, then
the id of exchange will change. and sometimes this leads to another
ReusedExchangeExec pointing to Exchange which doesn't exist in plan.
Example: Suppose this is the plan after Phase-1 when we try to do self join
of a view.
SORTMERGEJOIN
Exchange (id=1234) ReusedExchangeExec
(points-to-id=1234)
|
ChildSubtree
Suppose ChildSubtree has DPP applied inside it. So Phase-2 will try to
convert plan inside InSubqueryExec to use ReuseBroadcast and in that process,
complete hierarchy of ChildSubtree will also change. i.e.
SORTMERGEJOIN
Exchange (id=1878) ReusedExchangeExec
(points-to-id=1234)
|
NewChildSubtree
But the `ReusedExchangeExec (points-to-id=1234)` is still pointing to id
1234 and so no reuse will happen.
This PR fixes this issue by merging Phase1,Phase2 and Phase3 into a single
post order traversal.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]