[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

RaviShankar KS (JIRA) Tue, 06 Oct 2015 22:18:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


RaviShankar KS updated SPARK-10967:
-----------------------------------
    Description: 
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.
As long as leaf column names are same under nested columns, should order matter 
??

Consider below example for two data frames d5 and d5_opp : 

d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col1: string (nullable = false)
 |    |-- col2: string (nullable = false)

d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col2: string (nullable = false)
 |    |-- col1: string (nullable = false)

The below join statement do not work :

d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value",  
"inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array<struct<col1:string,col2:string>> and 
array<struct<col2:string,col1:string>>).;

d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === $"d5_opp.value1",  
"inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct<col1:string,col2:string> and struct<col2:string,col1:string>).;



  was:
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.

Consider below example for two data frames d5 and d5_opp : 


d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col1: string (nullable = false)
 |    |-- col2: string (nullable = false)

d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col2: string (nullable = false)
 |    |-- col1: string (nullable = false)

 



> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
>                 Key: SPARK-10967
>                 URL: https://issues.apache.org/jira/browse/SPARK-10967
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: RaviShankar KS
>            Assignee: Josh Rosen
>              Labels: sql, union
>             Fix For: 1.5.0
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf column names are same under nested columns, should order 
> matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col1: string (nullable = false)
>  |    |-- col2: string (nullable = false)
> d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col2: string (nullable = true)
>  |    |    |-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col2: string (nullable = false)
>  |    |-- col1: string (nullable = false)
> The below join statement do not work :
> d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value",  
> "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array<struct<col1:string,col2:string>> and 
> array<struct<col2:string,col1:string>>).;
> d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === $"d5_opp.value1",  
> "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct<col1:string,col2:string> and struct<col2:string,col1:string>).;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

Reply via email to