[
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265000#comment-15265000
]
Frederick Reiss commented on SPARK-14781:
-----------------------------------------
Yeah, Distinct will impact performance for the uncorrelated case if the
subquery returns more than a few million rows. That problem won't occur in the
particular case of TPC-DS query 45 (the subquery there returns at most 500k
rows at a 100TB scale factor), but you never know. And of course a Distinct
after the join, as one would need to cover EXISTS, would see potentially
billions of rows. I just figured I'd mention that possibility as an expedient
that doesn't require any additional operators.
I'd be up to adding a "LeftSemiPlus" mode to the various join operators if
you'd prefer for implementation to start with that step. The new behavior is
almost the same as the existing LeftSemi mode: one additional output column in
the schema, plus code to emit rows with a null value when nothing on the inner
matches an outer tuple.
> Support subquery in nested predicates
> -------------------------------------
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which
> will output every row from left, plus additional column as the result of join
> condition. Then we could replace the EXISTS() or IN() by the result column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]