[ 
https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702690#comment-15702690
 ] 

Nattavut Sutyanyong commented on SPARK-18582:
---------------------------------------------

For {{Window}} operator, I propose to place it in the last category noted in my 
previous comment. Here is an example showing it may return incorrect results.

Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")

sql("select * from t1 where c1 in (select sum(c1) over () from t2 where 
t1.c2=t2.c2)")

When pulling up the correlated predicate through the {{Window}} operator, it 
does not add the column T2.C2 into the PARTITION clause of the {{Window}} 
operator and return no row in the above example. The correct result is the row 
of (1, 1).

> Whitelist LogicalPlan operators allowed in correlated subqueries
> ----------------------------------------------------------------
>
>                 Key: SPARK-18582
>                 URL: https://issues.apache.org/jira/browse/SPARK-18582
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> We want to tighten the code that handles correlated subquery to whitelist 
> operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
>       // Simplify the predicates before pulling them out.
>       val transformed = BooleanSimplification(sub) transformUp {
>         case f @ Filter(cond, child) => ...
>         case p @ Project(expressions, child) => ...
>         case a @ Aggregate(grouping, expressions, child) => ...
>         case w : Window => ...
>         case j @ Join(left, _, RightOuter, _) => ...
>         case j @ Join(left, right, FullOuter, _) => ...
>         case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
>         case u: Union => ...
>         case s: SetOperation => ...
>         case e: Expand => ...
>         case l : LocalLimit => ...
>         case g : GlobalLimit => ...
>         case s : Sample => ...
>         case p =>
>           failOnOuterReference(p)
>           ...
>       }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation 
> on a case by case basis. As it is today, it only blocks {{Union}}, 
> {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} 
> {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of 
> {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the 
> list above are permitted to be under a correlation point. Is this risky? 
> There are many (30+ at least from browsing the {{LogicalPlan}} type 
> hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only 
> {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed 
> ({{CheckAnalysis.scala}} around line 126-165 in and after {{def 
> cleanQuery}}). We should whitelist which operators are allowed in correlated 
> subqueries. At my first glance, we should allow, in addition to the ones 
> allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to