LantaoJin opened a new pull request #29021:
URL: https://github.com/apache/spark/pull/29021


   ### What changes were proposed in this pull request?
   Current the AQE skew join handling logic is very specified.
   It can only handle the pattern like this:
   ```
     SMJ
        Sort
          Shuffle
        Sort
          Shuffle
   ```
   
   We propose a more general skew Join pattern matching patch with less code 
changes.
   In this patch, we can handle 3-tables join, join with aggregation, and so on.
   
   
   ### Why are the changes needed?
   In our production user cases, we found lots of slow jobs due to data skewing 
even we have enabled AQE skewed join. After investigated their patterns, we 
found current skewed join handle logic is so specified which can satisfied less 
production queries. The production queries are much more complicated than this 
pattern.
   ```
     SMJ
        Sort
          Shuffle
        Sort
          Shuffle
   ```
   A straightforward case I will introduce here:
   
![Screen_Shot_2020-07-06_at_2_55_34_PM](https://user-images.githubusercontent.com/1853780/86734769-d6ba0800-c064-11ea-9b94-2276ceec54e5.jpg)
   
   In above plan, there are 5 tables join case. This is not a simple case could 
be matched by above pattern. But we still could see it is very similar with the 
pattern if we removed all the **red** boxes.
   
   From the stage graph, the plan is much more straightforward:
   
![Screen_Shot_2020-07-06_at_2_54_56_PM](https://user-images.githubusercontent.com/1853780/86735373-56e06d80-c065-11ea-9b49-ba47717b9d4b.jpg)
   The green boxes pattern is what we want to handle whatever red boxes exist 
or not.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Add a UT
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to