[ 
https://issues.apache.org/jira/browse/SPARK-19425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-19425:
------------------------------------
    Description: 
DataFrame.except doesn't work for UDT columns. It is because 
ExtractEquiJoinKeys will run Literal.default against UDT. However, we don't 
handle UDT in Literal.default and an exception will throw like:

java.lang.RuntimeException: no default for type 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)

<del>We should simply skip using the columns whose types don't provide default 
literal as joining key.</del>

More simple fix is just let Literal.default handle UDT by its sql type. So we 
can use more efficient join type on UDT.

Besides except, this also fixes other similar scenarios, so in summary this 
fixes:

* except on two Datasets with UDT
* intersect on two Datasets with UDT
* Join with the join conditions using <=> on UDT columns


  was:
DataFrame.except doesn't work for UDT columns. It is because 
ExtractEquiJoinKeys will run Literal.default against UDT. However, we don't 
handle UDT in Literal.default and an exception will throw like:

java.lang.RuntimeException: no default for type 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)

We should simply skip using the columns whose types don't provide default 
literal as joining key.


> Make ExtractEquiJoinKeys support UDT columns
> --------------------------------------------
>
>                 Key: SPARK-19425
>                 URL: https://issues.apache.org/jira/browse/SPARK-19425
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Liang-Chi Hsieh
>
> DataFrame.except doesn't work for UDT columns. It is because 
> ExtractEquiJoinKeys will run Literal.default against UDT. However, we don't 
> handle UDT in Literal.default and an exception will throw like:
> java.lang.RuntimeException: no default for type 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
> <del>We should simply skip using the columns whose types don't provide 
> default literal as joining key.</del>
> More simple fix is just let Literal.default handle UDT by its sql type. So we 
> can use more efficient join type on UDT.
> Besides except, this also fixes other similar scenarios, so in summary this 
> fixes:
> * except on two Datasets with UDT
> * intersect on two Datasets with UDT
> * Join with the join conditions using <=> on UDT columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to