[ 
https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-27855:
----------------------------
    Description: 
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity
df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
field]
converted
root
|-- key: integer (nullable = false)
|-- a: integer (nullable = false)
|-- b: string (nullable = true)

root
|-- key: integer (nullable = false)
|-- b: string (nullable = true)
|-- a: integer (nullable = false)

org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "a")
- root class: "Entity"{code}
The problem is that the two datasets of the same type have different schemas.

The schema of the dataset does not conserve the order of the fields in the case 
class definition, but the one of the original dataframe

  was:
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity
df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
field]
converted
root
|-- key: integer (nullable = false)
|-- a: integer (nullable = false)
|-- b: string (nullable = true)

root
|-- key: integer (nullable = false)
|-- b: string (nullable = true)
|-- a: integer (nullable = false)

org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "a")
- root class: "Entity"{code}


> Union failed between 2 datasets of the same type converted from different 
> dataframes
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-27855
>                 URL: https://issues.apache.org/jira/browse/SPARK-27855
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.3
>            Reporter: Hao Ren
>            Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the 
> schema of the orignal dataframe, even if the two datasets have already been 
> converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more 
> field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
> field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
> as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}
> The problem is that the two datasets of the same type have different schemas.
> The schema of the dataset does not conserve the order of the fields in the 
> case class definition, but the one of the original dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to