minyyy opened a new pull request #35864:
URL: https://github.com/apache/spark/pull/35864


   ### What changes were proposed in this pull request?
   
   The "prune unrequired references" branch has the condition:
   
   `case p @ Project(_, g: Generate) if p.references != g.outputSet => `
   
   This is wrong as generators like Inline will always enter this branch as 
long as it does not use all the generator output.
   
   Example:
   
   input: <col1: array<struct<a: struct<a: int, b: int>, b: int>>>
   
   Project(a.a as x)
   \- Generate(Inline(col1), ..., a, b)
   
   p.references is [a]
   g.outputSet is [a, b]
   
   This bug makes us never enter the GeneratorNestedColumnAliasing branch below 
thus miss some optimization opportunities. This PR changes the condition to 
check whether the child output is not used by the project and it is either not 
used by the generator or not already put into unrequiredChildOutput.
   
   ### Why are the changes needed?
   The wrong condition prevents some expressions like Inline, PosExplode from 
being optimized by rules after it. Before the PR, the test query added in the 
PR is not optimized since the optimization rule is not able to apply to it. 
After the PR the optimization rule can be correctly applied to the query.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to