[
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takeshi Yamamuro closed SPARK-20312.
------------------------------------
Resolution: Fixed
> query optimizer calls udf with null values when it doesn't expect them
> ----------------------------------------------------------------------
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Albert Meltzer
> Fix For: 2.1.1, 2.2.0, 2.3.0
>
>
> When optimizing an outer join, spark passes an empty row to both sides to see
> if nulls would be ignored (side comment: for half-outer joins it subsequently
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT
> NULL}} might result in checking the right side first, and an exception if the
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS
> NOT NULL added filter on the result
> val df1 = sparkSession
> .table(...)
> .select("col1", "col2") // LongType both
> val df11 = df1
> .filter(df1("col1").isNotNull)
> .withColumn("col3", functions.udf(func)(df1("col1"))
> .repartition(df1("col2"))
> .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and
> sort
> val df3 =
> df1.join(df2, Seq("col2"), "left_outer")
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]