wenyuen-db opened a new pull request, #42635:
URL: https://github.com/apache/spark/pull/42635

   
   
   ### What changes were proposed in this pull request?
   
   Originally, when a CTE has duplicate expression IDs in its output, the rule 
PushdownPredicatesAndPruneColumnsForCTEDef wrongly assesses that the columns in 
the CTE were pruned, as it compares the size of the attribute set containing 
the union of columns (which is unique) and the original output of the CTE 
(which contains duplicate columns) and notices that the former is less than the 
latter. This causes incorrect pruning of the CTE output, resulting in a missing 
reference and causing the error as documented in the ticket.
   
   This PR changes the logic to use the needsPruning function to assess whether 
a CTE has been pruned, which uses the outputSet to check if any columns has 
been pruned instead of the output.
   
   ### Why are the changes needed?
   
   The incorrect behaviour of PushdownPredicatesAndPruneColumnsForCTEDef in 
CTEs with duplicate expression IDs in its output causes a crash when such a 
query is run.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit test for the crashing case was added.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to