cloud-fan commented on PR #39633:
URL: https://github.com/apache/spark/pull/39633#issuecomment-1397986059
Since the storage partitioned join becomes more and more complicated, shall
we revisit the framework? Essentially we need to join the partitions from two
v2 tables at the driver side, and launch a task to process each partition pair.
One idea is to add a physical rule before `EnsureRequirements`. This new rule
matches join of two v2 tables, applies a join strategy, and sets a special flag
in the join node so that `EnsureRequirements` can skip it. This is similar to
`ShuffledJoin.isSkewJoin`:
```
trait ShuffledJoin extends JoinCodegenSupport {
def isSkewJoin: Boolean
override def requiredChildDistribution: Seq[Distribution] = {
if (isSkewJoin) {
// We re-arrange the shuffle partitions to deal with skew join, and
the new children
// partitioning doesn't satisfy `HashClusteredDistribution`.
UnspecifiedDistribution :: UnspecifiedDistribution :: Nil
} else {
ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) ::
Nil
}
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]