[
https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan resolved SPARK-28156.
---------------------------------
Resolution: Fixed
Fix Version/s: 3.0.0
Issue resolved by pull request 24960
[https://github.com/apache/spark/pull/24960]
> Join plan sometimes does not use cached query
> ---------------------------------------------
>
> Key: SPARK-28156
> URL: https://issues.apache.org/jira/browse/SPARK-28156
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.3, 3.0.0, 2.4.3
> Reporter: Bruce Robbins
> Priority: Major
> Fix For: 3.0.0
>
>
> I came across a case where a cached query is referenced on both sides of a
> join, but the InMemoryRelation is inserted on only one side. This case occurs
> only when the cached query uses a (Hive-style) view.
> Consider this example:
> {noformat}
> // create the data
> val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b",
> "c", "d")
> df1.write.mode("overwrite").format("orc").saveAsTable("table1")
> sql("drop view if exists table1_vw")
> sql("create view table1_vw as select * from table1")
> // create the cached query
> val cacheddataDf = sql("""
> select a, b, c, d
> from table1_vw
> """)
> import org.apache.spark.storage.StorageLevel.DISK_ONLY
> cacheddataDf.createOrReplaceTempView("cacheddata")
> cacheddataDf.persist(DISK_ONLY)
> // main query
> val queryDf = sql(s"""
> select leftside.a, leftside.b
> from cacheddata leftside
> join cacheddata rightside
> on leftside.a = rightside.a
> """)
> queryDf.explain(true)
> {noformat}
> Note that the optimized plan does not use an InMemoryRelation for the right
> side, but instead just uses a Relation:
> {noformat}
> Project [a#45, b#46]
> +- Join Inner, (a#45 = a#37)
> :- Project [a#45, b#46]
> : +- Filter isnotnull(a#45)
> : +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1
> replicas)
> : +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40]
> Batched: true, DataFilters: [], Format: ORC, Location:
> InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
> PartitionFilters: [], PushedFilters: [], ReadSchema:
> struct<a:int,b:int,c:int,d:int>
> +- Project [a#37]
> +- Filter isnotnull(a#37)
> +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> The fragment does not match the cached query because AliasViewChild adds an
> extra projection under the View on the right side (see #2 below).
> AliasViewChild adds the extra projection because the exprIds in the View's
> output appears to have been renamed by Analyzer$ResolveReferences (#1 below).
> I have not yet looked at why.
> {noformat}
> -
> -
> -
> +- SubqueryAlias `rightside`
> +- SubqueryAlias `cacheddata`
> +- Project [a#73, b#74, c#75, d#76]
> +- SubqueryAlias `default`.`table1_vw`
> (#1) -> +- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
> (#2) -> +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS
> b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
> +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int)
> AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
> +- Project [a#37, b#38, c#39, d#40]
> +- SubqueryAlias `default`.`table1`
> +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> In a larger query (where cachedata may be referred on either side only
> indirectly), this phenomenon can create certain oddities, as the fragment is
> not replaced with InMemoryRelation, and the fragment is present when the plan
> is optimized as a whole.
> In Spark 2.1.3, Spark uses InMemoryRelation on both sides.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]