koert kuipers created SPARK-20359:
-------------------------------------

             Summary: Catalyst EliminateOuterJoin optimization can cause NPE
                 Key: SPARK-20359
                 URL: https://issues.apache.org/jira/browse/SPARK-20359
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0
         Environment: spark master at commit 
35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
            Reporter: koert kuipers
             Fix For: 2.2.0


we were running in to an NPE in one of our UDFs for spark sql.
 
now this particular function indeed could not handle nulls, but this was by 
design since null input was never allowed (and we would want it to blow up if 
there was a null as input).

we realized the issue was not in our data when we added filters for nulls and 
the NPE still happened. then we also saw the NPE when just doing 
dataframe.explain instead of running our job.

turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row with 
all nulls ifs fed into the expression as a test. its the line:
val v = boundE.eval(emptyRow)

i believe it is a bug to assume the expression can always handle nulls.

for example this fails:
{noformat}
val df1 = Seq("a", "b", "c").toDF("x")
  .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
val df2 = Seq("a", "b").toDF("x1")
df1
  .join(df2, df1("x") === df2("x1"), "left_outer")
  .filter($"x1".isNotNull || !$"y".isin("a!"))
  .count
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to