[
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264789#comment-15264789
]
Frederick Reiss commented on SPARK-14781:
-----------------------------------------
[~davies]: I think I have a minimally-invasive plan for covering Q45.
*Existing code:* {{RewritePredicateSubquery.apply}} applies to conjunctions of
predicates in the WHERE clause. When this rule finds an IN predicate with a
subquery, the rule rewrites the IN predicate into a join.
*Proposed change:* Modify {{RewritePredicateSubquery}} so that it also detects
disjunctions (ORs) where exactly one child of the disjunction is an IN
predicate with a non-correlated subquery. Rewrite each such disjunction into a
left outer join, followed by a Filter. The inner (right) operand of the left
outer join should be the subquery with an additional Distinct operator above
it. The Filter will apply the remaining predicates from the disjunction to any
tuples that did not join with the subquery.
*Notes:*
The Distinct here is needed because the in-list could contain duplicates. The
Distinct could be eliminated if there was a join operator that combined the
behavior of LeftOuter and LeftSemijoin. I suppose that's what SemiPlus will do?
This approach could be extended to cover correlated IN/EXISTS subqueries. The
rewrite would need to add unique IDs to the outer query's tuples before the
join + filter, then remove duplicates after the join + filter. I'm *not*
planning to do this extension in the first pass.
The approach could also be extended to cover multiple subqueries inside a
disjunction by chaining together multiple outer joins. I'm *not* planning to do
this extension in the first pass.
*Questions:*
* Do you foresee any problems with this approach?
* There is a second version of the IN/EXISTS subquery rewrite logic in PR
#12720, but that code hasn't been merged yet. Would you prefer a diff against
the current head; or a diff against the logic in PR 12720?
> Support subquery in nested predicates
> -------------------------------------
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which
> will output every row from left, plus additional column as the result of join
> condition. Then we could replace the EXISTS() or IN() by the result column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]