[jira] [Updated] (SPARK-51068) CTEs are not canonicalized and resulting in cached result not being used and recomputed

Nimesh Khandelwal (Jira) Tue, 25 Feb 2025 20:36:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-51068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nimesh Khandelwal updated SPARK-51068:
--------------------------------------
    Affects Version/s:     (was: 4.0.0)

> CTEs are not canonicalized and resulting in cached result not being used and 
> recomputed
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-51068
>                 URL: https://issues.apache.org/jira/browse/SPARK-51068
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2, 3.1.3, 3.3.2
>            Reporter: Nimesh Khandelwal
>            Priority: Major
>
> To check whether the plan exists in the cache or not, CacheManager matches 
> the canonicalized version of the plan. Currently, in canonicalized versions, 
> CTEIds are not handled and thus result in unnecessary cache misses in cases 
> where queries using CTE are stored. This issue starts after the commit to 
> [Avoid inlining non-deterministic 
> With-CTEs|https://github.com/apache/spark/pull/33671/files] in which each 
> CTERelationDef and CTERelationRef were introduced and their canonicalization 
> was not handled.
> {code:java}
> >>>spark.sql("CACHE TABLE cached_cte AS WITH cte1 AS ( SELECT 1 AS id, 
> >>>'Alice' AS name UNION ALL SELECT 2 AS id, 'Bob' AS name ), cte2 AS ( 
> >>>SELECT 1 AS id, 10 AS score UNION ALL SELECT 2 AS id, 20 AS score ) SELECT 
> >>>cte1.id, cte1.name, cte2.score FROM cte1 JOIN cte2 ON cte1.id = cte2.id");
> DataFrame[]
> >>> spark.sql("select count(*) from cached_cte").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)])
>    +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=165]
>       +- HashAggregate(keys=[], functions=[partial_count(1)])
>          +- Project
>             +- BroadcastHashJoin [id#120], [id#124], Inner, BuildRight, false
>                :- Union
>                :  :- Project [1 AS id#120]
>                :  :  +- Scan OneRowRelation[]
>                :  +- Project [2 AS id#122]
>                :     +- Scan OneRowRelation[]
>                +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as 
> bigint)),false), [plan_id=160]
>                   +- Union
>                      :- Project [1 AS id#124]
>                      :  +- Scan OneRowRelation[]
>                      +- Project [2 AS id#126]
>                         +- Scan OneRowRelation[]{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-51068) CTEs are not canonicalized and resulting in cached result not being used and recomputed

Reply via email to