[I] [to be discussed] Configure spark PartitionAwareClusteringPlanStrategy to build all clustering groups in spark driver [hudi]

via GitHub Thu, 15 Jan 2026 15:08:39 -0800


kbuci opened a new issue, #17902:
URL: https://github.com/apache/hudi/issues/17902


   ### Task Description
   
   **What needs to be done:**
   Currently when 
`org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy#generateClusteringPlan`
 creates clustering groups for all targeted partitions, it uses the engine to 
create a task per partition to generate all clustering groups. We want to add a 
config that, if enabled, will make this not use the engine context but just 
directly build the clustering groups (in the spark driver)
   
   **Why this task is needed:*
   We have encountered spark stage failures when attempt to perform clustering 
(to stitch small files together) on a single partition with many (700,000+) 
files in its DFS directory. In order to avoid having to continually increase 
executor memory, we implemented the above functionality in our internal HUDI 
0.x build. Since typically for our workloads we assign twice the driver memory 
as executors, and mostly we only target 1-2 partitions per clustering plan.
   
   ### Task Type
   
   Code improvement/refactoring
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [to be discussed] Configure spark PartitionAwareClusteringPlanStrategy to build all clustering groups in spark driver [hudi]

Reply via email to