[
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452415#comment-15452415
]
Nattavut Sutyanyong commented on SPARK-17337:
---------------------------------------------
This problem originated from the Case 1 documented in SPARK-16951. The script
illustrates the problem
{noformat}
Seq(1,2).toDF("c1").createOrReplaceTempView("t1")
Seq(1).toDF("c2").createOrReplaceTempView("t2")
scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").show
+----+
| c3|
+----+
| 2|
|null|
+----+
{noformat}
The correct answer is 1 row of (2). From the plan below, the incorrect portion
of the plan is the LeftAnti, rewritten from the NOT IN subquery, is pushed down
below the (T1 LOJ T2) operation. Because LeftAnti predicate is evaluated to
unknown (or null) if any argument of a comparison operator in the predicate is
null, e.g. NULL = <value> is evaluated to unknown (which is equivalent to false
in the context of a predicate), the LeftAnti predicate cannot be pushed down
into a LOJ operation.
{noformat}
scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT 'c3 IN (list#124)
: +- 'SubqueryAlias list#124
: +- 'Project ['c2]
: +- 'UnresolvedRelation `t2`
+- 'SubqueryAlias t3
+- 'Project [('t2.c2 + 1) AS c3#123]
+- 'Join LeftOuter, ('t1.c1 = 't2.c2)
:- 'UnresolvedRelation `t1`
+- 'UnresolvedRelation `t2`
== Analyzed Logical Plan ==
c3: int
Project [c3#123]
+- Filter NOT predicate-subquery#124 [(c3#123 = c2#77)]
: +- SubqueryAlias predicate-subquery#124 [(c3#123 = c2#77)]
: +- Project [c2#77]
: +- SubqueryAlias t2
: +- Project [value#75 AS c2#77]
: +- LocalRelation [value#75]
+- SubqueryAlias t3
+- Project [(c2#77 + 1) AS c3#123]
+- Join LeftOuter, (c1#3 = c2#77)
:- SubqueryAlias t1
: +- Project [value#1 AS c1#3]
: +- LocalRelation [value#1]
+- SubqueryAlias t2
+- Project [value#75 AS c2#77]
+- LocalRelation [value#75]
== Optimized Logical Plan ==
Project [(c2#77 + 1) AS c3#123]
+- Join LeftOuter, (c1#3 = c2#77)
:- Project [value#1 AS c1#3]
: +- Join LeftAnti, (isnull(((c2#77 + 1) = c2#77)) || ((c2#77 + 1) = c2#77))
: :- LocalRelation [value#1]
: +- LocalRelation [c2#77]
+- LocalRelation [c2#77]
{noformat}
> Incomplete algorithm for name resolution in Catalyst paser may lead to
> incorrect result
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Nattavut Sutyanyong
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT
> IN subquery. I thought originally it is an edge case. Further investigation
> found this is a more general problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]