[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

John Zhuge (Jira) Tue, 24 Oct 2023 21:33:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779326#comment-17779326
 ]


John Zhuge commented on SPARK-45657:
------------------------------------

It is fixed in main branch
{code:java}
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.7)
Type in expressions to have them evaluated.
Type :help for more information.
23/10/24 21:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.86.29:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1698208231783).
Spark session available as 'spark'.scala> spark.sql("select 1 id union select 
's2' id").cache()
val res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: 
string]scala> spark.sql("select 1 id union select 's2' 
id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan
val res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Union false, false
:- InMemoryRelation [id#11], StorageLevel(disk, memory, deserialized, 1 
replicas)
:     +- AdaptiveSparkPlan isFinalPlan=false
:        +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:           +- Exchange hashpartitioning(id#2, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
:              +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:                 +- Union
:                    :- Project [1 AS id#2]
:                    :  +- Scan OneRowRelation[]
:                    +- Project [s2 AS id#1]
:                       +- Scan OneRowRelation[]
+- Project [s3 AS s3#13]
   +- OneRowRelation {code}

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-45657
>                 URL: https://issues.apache.org/jira/browse/SPARK-45657
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: John Zhuge
>            Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

Reply via email to