[jira] [Resolved] (SPARK-28156) Join plan sometimes does not use cached query

Wenchen Fan (JIRA) Wed, 03 Jul 2019 06:26:11 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenchen Fan resolved SPARK-28156.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 24960
[https://github.com/apache/spark/pull/24960]

> Join plan sometimes does not use cached query
> ---------------------------------------------
>
>                 Key: SPARK-28156
>                 URL: https://issues.apache.org/jira/browse/SPARK-28156
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3, 3.0.0, 2.4.3
>            Reporter: Bruce Robbins
>            Priority: Major
>             Fix For: 3.0.0
>
>
> I came across a case where a cached query is referenced on both sides of a 
> join, but the InMemoryRelation is inserted on only one side. This case occurs 
> only when the cached query uses a (Hive-style) view.
> Consider this example:
> {noformat}
> // create the data
> val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
> "c", "d")
> df1.write.mode("overwrite").format("orc").saveAsTable("table1")
> sql("drop view if exists table1_vw")
> sql("create view table1_vw as select * from table1")
> // create the cached query
> val cacheddataDf = sql("""
> select a, b, c, d
> from table1_vw
> """)
> import org.apache.spark.storage.StorageLevel.DISK_ONLY
> cacheddataDf.createOrReplaceTempView("cacheddata")
> cacheddataDf.persist(DISK_ONLY)
> // main query
> val queryDf = sql(s"""
> select leftside.a, leftside.b
> from cacheddata leftside
> join cacheddata rightside
> on leftside.a = rightside.a
> """)
> queryDf.explain(true)
> {noformat}
> Note that the optimized plan does not use an InMemoryRelation for the right 
> side, but instead just uses a Relation:
> {noformat}
> Project [a#45, b#46]
> +- Join Inner, (a#45 = a#37)
>    :- Project [a#45, b#46]
>    :  +- Filter isnotnull(a#45)
>    :     +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
> replicas)
>    :           +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
> Batched: true, DataFilters: [], Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<a:int,b:int,c:int,d:int>
>    +- Project [a#37]
>       +- Filter isnotnull(a#37)
>          +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> The fragment does not match the cached query because AliasViewChild adds an 
> extra projection under the View on the right side (see #2 below).
> AliasViewChild adds the extra projection because the exprIds in the View's 
> output appears to have been renamed by Analyzer$ResolveReferences (#1 below). 
> I have not yet looked at why.
> {noformat}
> -
> -
> -
>    +- SubqueryAlias `rightside`
>       +- SubqueryAlias `cacheddata`
>          +- Project [a#73, b#74, c#75, d#76]
>             +- SubqueryAlias `default`.`table1_vw`
> (#1) ->        +- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
> (#2) ->           +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
> b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
>                      +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
> AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
>                         +- Project [a#37, b#38, c#39, d#40]
>                            +- SubqueryAlias `default`.`table1`
>                               +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> In a larger query (where cachedata may be referred on either side only 
> indirectly), this phenomenon can create certain oddities, as the fragment is 
> not replaced with InMemoryRelation, and the fragment is present when the plan 
> is optimized as a whole.
> In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-28156) Join plan sometimes does not use cached query

Reply via email to