[
https://issues.apache.org/jira/browse/SPARK-20359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-20359:
------------------------------------
Assignee: (was: Apache Spark)
> Catalyst EliminateOuterJoin optimization can cause NPE
> ------------------------------------------------------
>
> Key: SPARK-20359
> URL: https://issues.apache.org/jira/browse/SPARK-20359
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Environment: spark master at commit
> 35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
> Reporter: koert kuipers
> Fix For: 2.2.0
>
>
> we were running in to an NPE in one of our UDFs for spark sql.
>
> now this particular function indeed could not handle nulls, but this was by
> design since null input was never allowed (and we would want it to blow up if
> there was a null as input).
> we realized the issue was not in our data when we added filters for nulls and
> the NPE still happened. then we also saw the NPE when just doing
> dataframe.explain instead of running our job.
> turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row
> with all nulls ifs fed into the expression as a test. its the line:
> val v = boundE.eval(emptyRow)
> i believe it is a bug to assume the expression can always handle nulls.
> for example this fails:
> {noformat}
> val df1 = Seq("a", "b", "c").toDF("x")
> .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
> val df2 = Seq("a", "b").toDF("x1")
> df1
> .join(df2, df1("x") === df2("x1"), "left_outer")
> .filter($"x1".isNotNull || !$"y".isin("a!"))
> .count
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]