AngersZhuuuu commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690369334
> What's the high-level idea? We can handle skew SMJ because there is a
shuffle and we can split the partition with the granularity of shuffle blocks.
Broadcast join doesn't have shuffles.
Yeah, thought a lot that normal SQL can't match this case.
For our production env experience, data skew (such as stream side is group
by and skewed )before broadcast nested loop join always cause a long running
time.
two ways to avoid this case :
1. avoid nested loop join.
2. use distribute by to increase parallelism by dispersing data.
method 1 need to handle SQL logic, for method 2, although it will cause one
more time shuffle, it 's narrow dependent. Nowadays, network cost is cheap
and always not a bottleneck.
After try with AQE, AQE's mode is not suitable for this case. Since it
doesn't have a shuffle before BNLJ.
In our inner version, in BroadcastNestedLoopJoinExec, after stream side
executed, we will get the raw count of each partition and judge if it's skewed
seriously, if skewed seriously and volume is large, repartition stream side to
make stream side RDD average.
It's not very elegant but the benefits are very clear. I am not sure if
community will accept this way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]