AngersZhuuuu commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690369334


   > What's the high-level idea? We can handle skew SMJ because there is a 
shuffle and we can split the partition with the granularity of shuffle blocks. 
Broadcast join doesn't have shuffles.
   
   Yeah,  thought a lot that normal SQL can't match this case.
   
   For our production env experience, data skew (such as stream side is group 
by and skewed )before broadcast nested loop join always cause a  long running 
time.
   
   two ways to avoid this case :
    1. avoid nested loop join.
    2. use distribute by to increase parallelism by dispersing data.
   
   method 1 need to handle SQL logic, for method 2, although it will cause one 
more time  shuffle, it 's  narrow dependent. Nowadays, network cost is cheap 
and always not a  bottleneck.
   
   After try with AQE, AQE's mode is not suitable for this case. Since it 
doesn't have a shuffle before BNLJ. 
   In our inner version,  in BroadcastNestedLoopJoinExec, after stream side 
executed, we will get the raw count of each partition and judge if it's skewed 
seriously, if skewed seriously and volume is large, repartition stream side to 
make stream side RDD average.
   
   It's not very elegant but the benefits are very clear. I am not sure if 
community will accept this way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to