leanken opened a new pull request #29455:
URL: https://github.com/apache/spark/pull/29455


   ### What changes were proposed in this pull request?
   In [SPARK-32290](https://issues.apache.org/jira/browse/SPARK-32290), we 
managed to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking 
the BuildSide plan with `spark.sql.autoBroadcastJoinThreshold` parameter, which 
means in very bad case, BuildSide Plan being too big might cause Driver OOM. So 
the NAAJ support for ShuffledHashJoin is important as well.
   
   The support of SHJ for NAAJ has some difficulties in NullKey scenario, as 
for normal HashedRelation and EmtpyHashedRelation, the code logical should be 
the same when it comes to BHJ and SHJ, but if NullKey exists in global 
BuildSide data, and only one partition could be built into 
EmptyHashedRelationWithAllNullKeys, and this partition was not able to do `fast 
stop` for the entire RDD. So after offline talked with some committers and 
discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is 
on, Shuffle will be pre-executed, and we should be able to know that whether 
the BuildSide contains NullKey or not before the actual JOIN executed.
   
   Basically, In NAAJ SHJ Implementation, we collected information whether 
BuildSide is Empty or contains NullKey, and keep these information in 
ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into 
LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, 
we processed it in Distributed style.
   
   ### Why are the changes needed?
   If BuildSide plan is too big in NAAJ scenario, it might cause Driver OOM, 
distributed implementation for NAAJ (ShuffledHashJoin) is neccesary. 
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   * added case in AdaptiveQueryExecSuite.
   * updated case in SubquerySuite.
   * Make sure JoinSuite SQLQueryTestSuite passed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to