[GitHub] [spark] cloud-fan commented on pull request #39633: [SPARK-42038][SQL] SPJ: Support partially clustered distribution

GitBox Thu, 19 Jan 2023 22:51:26 -0800


cloud-fan commented on PR #39633:
URL: https://github.com/apache/spark/pull/39633#issuecomment-1397986059


   Since the storage partitioned join becomes more and more complicated, shall 
we revisit the framework? Essentially we need to join the partitions from two 
v2 tables at the driver side, and launch a task to process each partition pair. 
One idea is to add a physical rule before `EnsureRequirements`. This new rule 
matches join of two v2 tables, applies a join strategy, and sets a special flag 
in the join node so that `EnsureRequirements` can skip it. This is similar to 
`ShuffledJoin.isSkewJoin`:
   ```
   trait ShuffledJoin extends JoinCodegenSupport {
     def isSkewJoin: Boolean
   
       override def requiredChildDistribution: Seq[Distribution] = {
       if (isSkewJoin) {
         // We re-arrange the shuffle partitions to deal with skew join, and 
the new children
         // partitioning doesn't satisfy `HashClusteredDistribution`.
         UnspecifiedDistribution :: UnspecifiedDistribution :: Nil
       } else {
         ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
       }
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #39633: [SPARK-42038][SQL] SPJ: Support partially clustered distribution

Reply via email to