prashantwason commented on PR #18092:
URL: https://github.com/apache/hudi/pull/18092#issuecomment-3947823227

   @danny0405 Thanks for the review!
   
   Regarding your questions:
   
   1. **Why sort in `filterPartitionPaths()` vs `generateClusteringPlan()`**: 
      The sorting is intentionally placed in `filterPartitionPaths()` rather 
than near the limit in `generateClusteringPlan()` because:
      - `filterPartitionPaths()` is the source of partition paths for 
clustering - sorting here ensures consistency regardless of how partitions are 
used downstream
      - The `clusteringMaxNumGroups` limit in `generateClusteringPlan()` 
operates on clustering groups (not partitions directly), so sorting partitions 
there would be too late
      - Sorting at the filter level is more defensive and ensures deterministic 
behavior in all code paths that consume the filtered partitions
   
   2. **Similar issues in compaction**: 
      Looking at the compaction code, the partition-aware strategies already 
handle sorting:
      - `BoundedPartitionAwareCompactionStrategy.filterPartitionPaths()` sorts 
with `Comparator.reverseOrder()` (line 73-74)
      - `UnBoundedPartitionAwareCompactionStrategy.filterPartitionPaths()` also 
sorts in reverse order
      
      So compaction doesn't have this issue - the bounded strategies already 
include sorting as part of their logic.
   
   Let me know if you have any other concerns!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to