[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

RaviShankar KS (JIRA) Tue, 06 Oct 2015 23:09:54 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


RaviShankar KS updated SPARK-10967:
-----------------------------------
    Description: 
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.
As long as leaf columns have the same name under a nested column, should order 
matter ??

Consider below example for two data frames d5 and d5_opp : 
d5 and d5_opp have a nested field 'value', but their inner leaf columns do not 
have the same ordering. 

--       d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col1: string (nullable = false)
 |    |-- col2: string (nullable = false)

--        d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col2: string (nullable = false)
 |    |-- col1: string (nullable = false)

The below join statement do not work in spark 1.5, and raises exception. In 
spark 1.4, no exception is raised, but join result is incorrect :

--    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value", 
 "inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array<struct<col1:string,col2:string>> and 
array<struct<col2:string,col1:string>>).;

--    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
$"d5_opp.value1",  "inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct<col1:string,col2:string> and struct<col2:string,col1:string>).;

// Code to be used in spark shell to create the data frames is attached.
-------------------------
The only work-around is to explode the conditions for every leaf field. 
In our case, we are generating the conditions and dataframes programmatically, 
and exploding the conditions for every leaf field is additional overhead, and 
may not be always possible.

  was:
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.
As long as leaf columns have the same name under a nested column, should order 
matter ??

Consider below example for two data frames d5 and d5_opp : 
d5 and d5_opp have a nested field 'value', but their inner leaf columns do not 
have the same ordering. 

--       d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col1: string (nullable = false)
 |    |-- col2: string (nullable = false)

--        d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 |    |-- col2: string (nullable = false)
 |    |-- col1: string (nullable = false)

The below join statement do not work in spark 1.5, and raises exception. In 
spark 1.4, no exception is raised, but join result is incorrect :

--    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value", 
 "inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array<struct<col1:string,col2:string>> and 
array<struct<col2:string,col1:string>>).;

--    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
$"d5_opp.value1",  "inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct<col1:string,col2:string> and struct<col2:string,col1:string>).;

-------------------------
The only work-around is to explode the conditions for every leaf field. 
In our case, we are generating the conditions and dataframes programmatically, 
and exploding the conditions for every leaf field is additional overhead, and 
may not be always possible.


> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
>                 Key: SPARK-10967
>                 URL: https://issues.apache.org/jira/browse/SPARK-10967
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1
>         Environment: RHEL
>            Reporter: RaviShankar KS
>            Assignee: Josh Rosen
>              Labels: sql, union
>             Fix For: 1.5.0
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --       d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col1: string (nullable = false)
>  |    |-- col2: string (nullable = false)
> --        d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col2: string (nullable = true)
>  |    |    |-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col2: string (nullable = false)
>  |    |-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array<struct<col1:string,col2:string>> and 
> array<struct<col2:string,col1:string>>).;
> --    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct<col1:string,col2:string> and struct<col2:string,col1:string>).;
> // Code to be used in spark shell to create the data frames is attached.
> -------------------------
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

Reply via email to