[
https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuming Wang updated SPARK-32290:
--------------------------------
Target Version/s: (was: 3.0.0)
> NotInSubquery SingleColumn Optimize
> -----------------------------------
>
> Key: SPARK-32290
> URL: https://issues.apache.org/jira/browse/SPARK-32290
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Leanken.Lin
> Assignee: Leanken.Lin
> Priority: Minor
> Fix For: 3.1.0
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very
> very time consuming. For example, I've done TPCH benchmark lately, Query 16
> almost took half of the entire TPCH 22Query execution Time. So i proposed
> that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only
> single column in following pattern.
> {code:java}
> case _@Or(
> _@EqualTo(leftAttr: AttributeReference, rightAttr:
> AttributeReference),
> _@IsNull(
> _@EqualTo(_: AttributeReference, _: AttributeReference)
> )
> )
> {code}
> if buildSide rows is small enough, we can change build side data into a
> HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it
> takes only 30s to finish.
> But this optimize only works on single column not in subquery, so i am here
> to seek advise whether the community need this update or not. I will do the
> pull request first, if the community member thought it's hack, it's fine to
> just ignore this request.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]