chirag-s-db commented on PR #53859: URL: https://github.com/apache/spark/pull/53859#issuecomment-3818751056
Could we also support the case where the KeyGroupedPartitioning is the output of the plan with the following approach? 1. By default, `disableGrouping` is true in both the scan and in a field in the `KeyGroupedPartitioning`. 2. With `disableGrouping=true`, a `KeyGroupedPartitioning` can only satisfy the requirements of an UnspecifiedDistribution (or any distribution if there is only a single partition). However, with `disableGrouping=true`, we don't allow a `KeyGroupedPartitioning` (w/ > 1 partition) to satisfy the requirements of a Clustered or Ordered distribution. 3. In `EnsureRequirements`, we add a new case [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L70) that checks if the `KeyGroupedPartitioning` _could_ satisfy the requirement if it were to be grouped (essentially, if the current implementation of `satisfies` for `KeyGroupedPartitioning` that take into consideration whether grouping is enabled or disabled). If so, then we push down `disableGrouping=false` to _both_ the scan and the `KeyGroupedPartitioning`, after which point we know that the partitioning has been used to satisfy some requirements, so we must do grouping. If we can't push down to the scan (for example, if the plan reporting `KeyGroupedPartitioning` is checkpointed), then we just add a shuffle as normal. One advantage of this approach is that it allows us to avoid grouping for the (presumably not uncommon) case of a simple scan from a partitioned table, and it should still be safe for checkpointed scans (as the checkpointed scans would have a `KeyGroupedPartitioning` w/ `disableGrouping=true`, which would not satisfy most required distributions). This approach should also decrease the complexity of the EnsureRequirements changes (since we wouldn't have to catch all the cases in which a KeyGroupedPartitioning scan doesn't contribute to the output partitioning of the plan). FYI @szehon-ho -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
