[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264866#comment-15264866
 ] 

Davies Liu commented on SPARK-14781:
------------------------------------

Distinct is slow, it's better to not use that.

The LeftSemiPlus in my mind is something very close to LeftSemi, but 
1) emit all the row from left exact once
2) each row has an additional column, which is the result of join condition 
(it's nullable)

For any IN/EXISTS predicates, we do a LeftSemiPlus join on it's child, then 
replace the predicate with the additional attribute. (because LeftSemiPlus is 
not efficient as LeftSemi or LeftAnti, we may only do this when the predicate 
is not a top level conjunction)

When we create the logical Join with LeftSemiPlus, we could create this 
additional attribute, and pass it around in optimizer and planner, because it 
will be used by other operators.

We should support LeftSemiPlus in all the 4 join implementations.

> Support subquery in nested predicates
> -------------------------------------
>
>                 Key: SPARK-14781
>                 URL: https://issues.apache.org/jira/browse/SPARK-14781
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to