[GitHub] [spark] LantaoJin opened a new pull request #29021: [SPARK-32201][SQL] More general skew join pattern matching

GitBox Tue, 07 Jul 2020 00:26:32 -0700


LantaoJin opened a new pull request #29021:
URL: https://github.com/apache/spark/pull/29021

### What changes were proposed in this pull request?
Current the AQE skew join handling logic is very specified.
It can only handle the pattern like this:
```
SMJ
Sort
Shuffle
Sort
Shuffle
```

We propose a more general skew Join pattern matching patch with less code
changes.
In this patch, we can handle 3-tables join, join with aggregation, and so on.

### Why are the changes needed?
In our production user cases, we found lots of slow jobs due to data skewing
even we have enabled AQE skewed join. After investigated their patterns, we
found current skewed join handle logic is so specified which can satisfied less
production queries. The production queries are much more complicated than this
pattern.
```
SMJ
Sort
Shuffle
Sort
Shuffle
```
A straightforward case I will introduce here:

![Screen_Shot_2020-07-06_at_2_55_34_PM](https://user-images.githubusercontent.com/1853780/86734769-d6ba0800-c064-11ea-9b94-2276ceec54e5.jpg)

In above plan, there are 5 tables join case. This is not a simple case could
be matched by above pattern. But we still could see it is very similar with the
pattern if we removed all the **red** boxes.

From the stage graph, the plan is much more straightforward:

![Screen_Shot_2020-07-06_at_2_54_56_PM](https://user-images.githubusercontent.com/1853780/86735373-56e06d80-c065-11ea-9b49-ba47717b9d4b.jpg)
The green boxes pattern is what we want to handle whatever red boxes exist
or not.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a UT

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LantaoJin opened a new pull request #29021: [SPARK-32201][SQL] More general skew join pattern matching

Reply via email to