[
https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702684#comment-15702684
]
Nattavut Sutyanyong commented on SPARK-18582:
---------------------------------------------
We shall classify the operators into 4 categories:
# Operators that are allowed anywhere in a correlated subquery
and, by definition of the operators, they cannot host outer references.
# Operators that are allowed anywhere in a correlated subquery
so long as they do not host outer references.
# Operators that need special treatment. These operators are
Project, Filter, Join, Aggregate, and, Generate.
Any operators that are not in the above list are allowed in a correlated
subquery only if they are not on a correlation path. In other word, these
operators are allowed only under a correlation point.
Note that a correlation path is defined as the sub-tree of all the operators
that are on the path from the operator hosting the correlated expressions up to
the operator producing the correlated values.
> Whitelist LogicalPlan operators allowed in correlated subqueries
> ----------------------------------------------------------------
>
> Key: SPARK-18582
> URL: https://issues.apache.org/jira/browse/SPARK-18582
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Nattavut Sutyanyong
>
> We want to tighten the code that handles correlated subquery to whitelist
> operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
> // Simplify the predicates before pulling them out.
> val transformed = BooleanSimplification(sub) transformUp {
> case f @ Filter(cond, child) => ...
> case p @ Project(expressions, child) => ...
> case a @ Aggregate(grouping, expressions, child) => ...
> case w : Window => ...
> case j @ Join(left, _, RightOuter, _) => ...
> case j @ Join(left, right, FullOuter, _) => ...
> case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
> case u: Union => ...
> case s: SetOperation => ...
> case e: Expand => ...
> case l : LocalLimit => ...
> case g : GlobalLimit => ...
> case s : Sample => ...
> case p =>
> failOnOuterReference(p)
> ...
> }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation
> on a case by case basis. As it is today, it only blocks {{Union}},
> {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}}
> {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of
> {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the
> list above are permitted to be under a correlation point. Is this risky?
> There are many (30+ at least from browsing the {{LogicalPlan}} type
> hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only
> {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed
> ({{CheckAnalysis.scala}} around line 126-165 in and after {{def
> cleanQuery}}). We should whitelist which operators are allowed in correlated
> subqueries. At my first glance, we should allow, in addition to the ones
> allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]