leanken opened a new pull request #29104:
URL: https://github.com/apache/spark/pull/29104


   ### What changes were proposed in this pull request?
   Normally, a NotInSubquery will be planed into BroadcastNestedLoopJoin which 
is very time consuming, for instance, in TPCH Query 16.
   
   ```
   select
       p_brand,
       p_type,
       p_size,
       count(distinct ps_suppkey) as supplier_cnt
   from
       partsupp,
       part
   where
       p_partkey = ps_partkey
       and p_brand <> 'Brand#45'
       and p_type not like 'MEDIUM POLISHED%'
       and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
       and ps_suppkey not in (
           select
               s_suppkey
           from
               supplier
           where
               s_comment like '%Customer%Complaints%'
       )
   group by
       p_brand,
       p_type,
       p_size
   order by
       supplier_cnt desc,
       p_brand,
       p_type,
       p_size
   ```
   
   In above not in subquery, will planed into
   
   LeftAnti
       condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey))
   
   Inside BroadcastNestedLoopJoinExec will perform M\*N, if buildSide is small 
enough, we can always change buildSide into a HashSet, and streamedSide just 
need to lookup in the HashSet, then the calculation will be optimized into 
M\*log(N).
   
   But this optimize is only targeting on NotInSubquery with single column case.
   After apply this patch, the TPCH Query 16 performance decrease from 41mins 
to 30s
   
   ### Why are the changes needed?
   TPCH is a common benchmark for distributed compute engine, all other 21 
Query works fine on Spark, except for Query 16, apply this patch will make 
Spark more competitive among all these popular engine. BTW, this patch has 
restricted rules and only apply on NotInSubquery Single Column case, which is 
safe enough.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   1. Manually run org.apache.spark.sql.SQLQueryTestSuite.
   2. Compare performance before and after applying this patch against TPCH 
Query 16.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to