peter-toth commented on PR #54330: URL: https://github.com/apache/spark/pull/54330#issuecomment-3992828703
> 1. `isGrouped` recomputes on every `.copy()` Good idea! Fixed in https://github.com/apache/spark/pull/54330/commits/853e6b72fcdb7e528f2a394c7d658091facbf9f1 and https://github.com/apache/spark/pull/54330/commits/94054cb7c9029c0aa5120ffe5b7283aeebb725b2. > 2. `toGrouped` may not preserve sorted order Ok, done in https://github.com/apache/spark/pull/54330/commits/f5baf7670b6cba639b9b032586d161d7462f8efc. > 3. `GroupPartitionsExec.outputPartitioning` applies `transform` without verifying invariant > 4. `GroupPartitionsExec` has no `doCanonicalize` override > 6 . `equals`/`hashCode` on `KeyedPartitioning` allocates wrappers per key on every call I think these 3 comes down to how we store partitions keys in `KeyedPartitioning`. In https://github.com/apache/spark/pull/54330/commits/174fe90dbf27cbff2523d804ba8f2087bfd936e4 I changed it to `partitionKeys: Seq[InternalRowComparableWrapper]`, which made the code much cleaner at many places, simplified `equals`/`hashCode`, made `doCanonicalize` unnecessary. Interestingly, it also helped to identify a few fields that should be transient because `InternalRowComparableWrapper` is not serializable, but IMO we shoudn't send the keys (maybe not even any `Partitioning`s) to executors. > 5. `GroupPartitionsExec` doesn't handle columnar execution Good catch! I totally forgot about that. Added in https://github.com/apache/spark/pull/54330/commits/1b6bb2988a4ae9fca5f3050bf999be43a7ffd0ca. > 7. Dead variable in `ShuffleExchangeExec` Fixed in https://github.com/apache/spark/pull/54330/commits/fa432914e4981ea15fbe9d3463c2adcc16d05593. > 8. `applyGroupPartitions` doesn't handle `GroupPartitionsExec` nested under unary nodes Added documentation in https://github.com/apache/spark/pull/54330/commits/b43017d1736b20f20c9d27135ec2259d1fedc2ab. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
