[ 
https://issues.apache.org/jira/browse/SPARK-40664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40664.
---------------------------------
    Resolution: Not A Problem

> Union in query can remove cache from the plan
> ---------------------------------------------
>
>                 Key: SPARK-40664
>                 URL: https://issues.apache.org/jira/browse/SPARK-40664
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Tanel Kiis
>            Priority: Major
>
> Failing unitest:
> {code}
>   test("SPARK-40664: Cache with join, union and renames") {
>     val df1 = Seq("1", "2").toDF("a")
>     val df2 = Seq("2", "3").toDF("a")
>       .withColumn("b", lit("b"))
>     val joined = df1.join(broadcast(df2), "a")
>       // Messing around the column can cause some problems with cache manager
>       .withColumn("tmp_b", $"b")
>       .drop("b")
>       .withColumnRenamed("tmp_b", "b")
>       .cache()
>     val unioned = joined.union(joined)
>     assertCached(unioned, 2)
>   }
> {code}
> After this PR the test started failing: 
> https://github.com/apache/spark/pull/35214
> Plan before:
> {code}
> == Physical Plan ==
> Union
> :- InMemoryTableScan [a#4, b#23]
> :     +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> :           +- *(2) Project [a#4, b AS b#23]
> :              +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, 
> false
> :                 :- *(2) Project [value#1 AS a#4]
> :                 :  +- *(2) Filter isnotnull(value#1)
> :                 :     +- *(2) LocalTableScan [value#1]
> :                 +- BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
> :                    +- *(1) Project [value#7 AS a#10]
> :                       +- *(1) Filter isnotnull(value#7)
> :                          +- *(1) LocalTableScan [value#7]
> +- InMemoryTableScan [a#4, b#23]
>       +- InMemoryRelation [a#4, b#23], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>             +- *(2) Project [a#4, b AS b#23]
>                +- *(2) BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, 
> false
>                   :- *(2) Project [value#1 AS a#4]
>                   :  +- *(2) Filter isnotnull(value#1)
>                   :     +- *(2) LocalTableScan [value#1]
>                   +- BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#35]
>                      +- *(1) Project [value#7 AS a#10]
>                         +- *(1) Filter isnotnull(value#7)
>                            +- *(1) LocalTableScan [value#7]
> {code}
> Plan after:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Union
>    :- Project [a#4, b AS b#23]
>    :  +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
>    :     :- Project [value#1 AS a#4]
>    :     :  +- Filter isnotnull(value#1)
>    :     :     +- LocalTableScan [value#1]
>    :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> string, true]),false), [id=#115]
>    :        +- Project [value#7 AS a#10]
>    :           +- Filter isnotnull(value#7)
>    :              +- LocalTableScan [value#7]
>    +- Project [a#4, b AS b#39]
>       +- BroadcastHashJoin [a#4], [a#10], Inner, BuildRight, false
>          :- Project [value#36 AS a#4]
>          :  +- Filter isnotnull(value#36)
>          :     +- LocalTableScan [value#36]
>          +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> string, true]),false), [id=#118]
>             +- Project [value#37 AS a#10]
>                +- Filter isnotnull(value#37)
>                   +- LocalTableScan [value#37]
> {code}
> (The InMemoryTableScan is missing)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to