Leanken.Lin created SPARK-32290:
-----------------------------------

             Summary: NotInSubquery SingleColumn Optimize
                 Key: SPARK-32290
                 URL: https://issues.apache.org/jira/browse/SPARK-32290
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Leanken.Lin
             Fix For: 3.1.0


Normally,
A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very 
time consuming. For example, I've done TPCH benchmark lately, Query 16 almost 
took half of the entire TPCH 22Query execution Time. So i proposed that to do 
the following optimize.

Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
single column in following pattern.

{code:java}
case _@Or(
            _@EqualTo(leftAttr: AttributeReference, rightAttr: 
AttributeReference),
            _@IsNull(
              _@EqualTo(_: AttributeReference, _: AttributeReference)
            )
          )
{code}

if buildSide rows is small enough, we can change build side data into a HashMap.
so the M*N calculation can be optimized into M*log(N)

I've done a benchmark job in 1TB TPCH, before apply the optimize
Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
takes only 30s to finish. 

But this optimize only works on single column not in subquery, so i am here to 
seek advise whether the community need this update or not. I will do the pull 
request first, if the community member thought it's hack, it's fine to just 
ignore this request.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to