maryannxue opened a new pull request #33671:
URL: https://github.com/apache/spark/pull/33671


   ### What changes were proposed in this pull request?
   This PR fixes an existing correctness issue where a non-deterministic 
With-CTE can be executed multiple times producing different results, by 
deferring the inline of With-CTE to after the analysis stage. This fix also 
provides the future opportunity of performance improvement by executing 
deterministic With-CTEs only once in some circumstances.
   
   The major changes include:
   1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, 
`WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE 
def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved 
version of `With`, only that: 1) `WithCTE` is a multi-children logical node so 
that most logical rules can automatically apply to CTE defs; 2) In the main 
query and each subquery, there can only be at most one `WithCTE`, which means 
nested With-CTEs are combined.
   2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs 
will not be inlined immediately, but rather transformed into a `CTERelationRef` 
per reference.
   3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update 
`CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) 
`InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only 
ONE reference; 3) `UpdateCTERelationStats` - to update stats for 
`CTERelationRef`s that are not inlined.
   4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an 
independent shuffle with round-robin partitioning so that such CTEs will only 
be materialized once and different references will later be a shuffle reuse.
   
   A current limitation is that With-CTEs mixed with SQL commands or DMLs will 
still go through the old inline code path because of our non-standard language 
specs and not-unified command/DML interfaces.
   
   ### Why are the changes needed?
   This is a correctness issue. Non-deterministic CTEs should produce the same 
output regardless of how many times it is referenced/used in query, while under 
the current implementation there is no such guarantee and would lead to 
incorrect query results.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to