Re: [PR] [SPARK-55092][SQL] Disable partition grouping in `KeyGroupedPartitioning` when not needed [spark]

via GitHub Thu, 29 Jan 2026 08:22:17 -0800


chirag-s-db commented on PR #53859:
URL: https://github.com/apache/spark/pull/53859#issuecomment-3818751056

Could we also support the case where the KeyGroupedPartitioning is the
output of the plan with the following approach?

1. By default, `disableGrouping` is true in both the scan and in a field in
the `KeyGroupedPartitioning`.
2. With `disableGrouping=true`, a `KeyGroupedPartitioning` can only satisfy
the requirements of an UnspecifiedDistribution (or any distribution if there is
only a single partition). However, with `disableGrouping=true`, we don't allow
a `KeyGroupedPartitioning` (w/ > 1 partition) to satisfy the requirements of a
Clustered or Ordered distribution.
3. In `EnsureRequirements`, we add a new case
[here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L70)
that checks if the `KeyGroupedPartitioning` _could_ satisfy the requirement if
it were to be grouped (essentially, if the current implementation of
`satisfies` for `KeyGroupedPartitioning` that take into consideration whether
grouping is enabled or disabled). If so, then we push down
`disableGrouping=false` to _both_ the scan and the `KeyGroupedPartitioning`,
after which point we know that the partitioning has been used to satisfy some
requirements, so we must do grouping. If we can't push down to the scan (for
example, if the plan reporting `KeyGroupedPartitioning` is checkpointed), then
we just add a shuffle as normal.

One advantage of this approach is that it allows us to avoid grouping for
the (presumably not uncommon) case of a simple scan from a partitioned table,
and it should still be safe for checkpointed scans (as the checkpointed scans
would have a `KeyGroupedPartitioning` w/ `disableGrouping=true`, which would
not satisfy most required distributions). This approach should also decrease
the complexity of the EnsureRequirements changes (since we wouldn't have to
catch all the cases in which a KeyGroupedPartitioning scan doesn't contribute
to the output partitioning of the plan).

FYI @szehon-ho

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55092][SQL] Disable partition grouping in `KeyGroupedPartitioning` when not needed [spark]

Reply via email to