[ 
https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169928#comment-17169928
 ] 

Apache Spark commented on SPARK-32290:
--------------------------------------

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29335

> NotInSubquery SingleColumn Optimize
> -----------------------------------
>
>                 Key: SPARK-32290
>                 URL: https://issues.apache.org/jira/browse/SPARK-32290
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Assignee: Leanken.Lin
>            Priority: Minor
>             Fix For: 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very 
> very time consuming. For example, I've done TPCH benchmark lately, Query 16 
> almost took half of the entire TPCH 22Query execution Time. So i proposed 
> that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
> single column in following pattern.
> {code:java}
> case _@Or(
>             _@EqualTo(leftAttr: AttributeReference, rightAttr: 
> AttributeReference),
>             _@IsNull(
>               _@EqualTo(_: AttributeReference, _: AttributeReference)
>             )
>           )
> {code}
> if buildSide rows is small enough, we can change build side data into a 
> HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
> takes only 30s to finish. 
> But this optimize only works on single column not in subquery, so i am here 
> to seek advise whether the community need this update or not. I will do the 
> pull request first, if the community member thought it's hack, it's fine to 
> just ignore this request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to