chirag-s-db commented on PR #53859:
URL: https://github.com/apache/spark/pull/53859#issuecomment-3818751056

   Could we also support the case where the KeyGroupedPartitioning is the 
output of the plan with the following approach?
   
   1. By default, `disableGrouping` is true in both the scan and in a field in 
the `KeyGroupedPartitioning`.
   2. With `disableGrouping=true`, a `KeyGroupedPartitioning` can only satisfy 
the requirements of an UnspecifiedDistribution (or any distribution if there is 
only a single partition). However, with `disableGrouping=true`, we don't allow 
a `KeyGroupedPartitioning` (w/ > 1 partition) to satisfy the requirements of a 
Clustered or Ordered distribution.
   3. In `EnsureRequirements`, we add a new case 
[here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L70)
 that checks if the `KeyGroupedPartitioning` _could_ satisfy the requirement if 
it were to be grouped (essentially, if the current implementation of 
`satisfies` for `KeyGroupedPartitioning` that take into consideration whether 
grouping is enabled or disabled). If so, then we push down 
`disableGrouping=false` to _both_ the scan and the `KeyGroupedPartitioning`, 
after which point we know that the partitioning has been used to satisfy some 
requirements, so we must do grouping. If we can't push down to the scan (for 
example, if the plan reporting `KeyGroupedPartitioning` is checkpointed), then 
we just add a shuffle as normal.
   
   One advantage of this approach is that it allows us to avoid grouping for 
the (presumably not uncommon) case of a simple scan from a partitioned table, 
and it should still be safe for checkpointed scans (as the checkpointed scans 
would have a `KeyGroupedPartitioning` w/ `disableGrouping=true`, which would 
not satisfy most required distributions). This approach should also decrease 
the complexity of the EnsureRequirements changes (since we wouldn't have to 
catch all the cases in which a KeyGroupedPartitioning scan doesn't contribute 
to the output partitioning of the plan).
   
   FYI @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to