[
https://issues.apache.org/jira/browse/SPARK-51068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-51068:
-----------------------------------
Labels: pull-request-available (was: )
> CTEs are not canonicalized and resulting in cached result not being used and
> recomputed
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-51068
> URL: https://issues.apache.org/jira/browse/SPARK-51068
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.2, 3.1.3, 3.3.2
> Reporter: Nimesh Khandelwal
> Priority: Major
> Labels: pull-request-available
>
> To check whether the plan exists in the cache or not, CacheManager matches
> the canonicalized version of the plan. Currently, in canonicalized versions,
> CTEIds are not handled and thus result in unnecessary cache misses in cases
> where queries using CTE are stored. This issue starts after the commit to
> [Avoid inlining non-deterministic
> With-CTEs|https://github.com/apache/spark/pull/33671/files] in which each
> CTERelationDef and CTERelationRef were introduced and their canonicalization
> was not handled.
> {code:java}
> >>>spark.sql("CACHE TABLE cached_cte AS WITH cte1 AS ( SELECT 1 AS id,
> >>>'Alice' AS name UNION ALL SELECT 2 AS id, 'Bob' AS name ), cte2 AS (
> >>>SELECT 1 AS id, 10 AS score UNION ALL SELECT 2 AS id, 20 AS score ) SELECT
> >>>cte1.id, cte1.name, cte2.score FROM cte1 JOIN cte2 ON cte1.id = cte2.id");
> DataFrame[]
> >>> spark.sql("select count(*) from cached_cte").explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=165]
> +- HashAggregate(keys=[], functions=[partial_count(1)])
> +- Project
> +- BroadcastHashJoin [id#120], [id#124], Inner, BuildRight, false
> :- Union
> : :- Project [1 AS id#120]
> : : +- Scan OneRowRelation[]
> : +- Project [2 AS id#122]
> : +- Scan OneRowRelation[]
> +- BroadcastExchange
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as
> bigint)),false), [plan_id=160]
> +- Union
> :- Project [1 AS id#124]
> : +- Scan OneRowRelation[]
> +- Project [2 AS id#126]
> +- Scan OneRowRelation[]{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]