[
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264866#comment-15264866
]
Davies Liu commented on SPARK-14781:
------------------------------------
Distinct is slow, it's better to not use that.
The LeftSemiPlus in my mind is something very close to LeftSemi, but
1) emit all the row from left exact once
2) each row has an additional column, which is the result of join condition
(it's nullable)
For any IN/EXISTS predicates, we do a LeftSemiPlus join on it's child, then
replace the predicate with the additional attribute. (because LeftSemiPlus is
not efficient as LeftSemi or LeftAnti, we may only do this when the predicate
is not a top level conjunction)
When we create the logical Join with LeftSemiPlus, we could create this
additional attribute, and pass it around in optimizer and planner, because it
will be used by other operators.
We should support LeftSemiPlus in all the 4 join implementations.
> Support subquery in nested predicates
> -------------------------------------
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which
> will output every row from left, plus additional column as the result of join
> condition. Then we could replace the EXISTS() or IN() by the result column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]