leanken opened a new pull request #29104:
URL: https://github.com/apache/spark/pull/29104
### What changes were proposed in this pull request?
Normally, a NotInSubquery will be planed into BroadcastNestedLoopJoin which
is very time consuming, for instance, in TPCH Query 16.
```
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size
```
In above not in subquery, will planed into
LeftAnti
condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey))
Inside BroadcastNestedLoopJoinExec will perform M\*N, if buildSide is small
enough, we can always change buildSide into a HashSet, and streamedSide just
need to lookup in the HashSet, then the calculation will be optimized into
M\*log(N).
But this optimize is only targeting on NotInSubquery with single column case.
After apply this patch, the TPCH Query 16 performance decrease from 41mins
to 30s
### Why are the changes needed?
TPCH is a common benchmark for distributed compute engine, all other 21
Query works fine on Spark, except for Query 16, apply this patch will make
Spark more competitive among all these popular engine. BTW, this patch has
restricted rules and only apply on NotInSubquery Single Column case, which is
safe enough.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
1. Manually run org.apache.spark.sql.SQLQueryTestSuite.
2. Compare performance before and after applying this patch against TPCH
Query 16.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]