kbuci opened a new issue, #17879:
URL: https://github.com/apache/hudi/issues/17879

   ### Task Description
   
   **What needs to be done:**
   A new `HoodieWriteConfig` key should be added to indicate that any 
incomplete clustering plans with an expired heartbeat should be completely 
rolled back by `rollbackFailedWrites`, with the plan deleted. HUDI will assume 
that clustering plans are only be attempted once (by the same writer client 
that scheduled the plan).
   
   Specifically, HUDI should do the following if and only if this config key is 
enabled
   -  `scheduleClustering` should start a heartbeat for determined instant time 
before publishing the plan to timeline. It should not start the heartbeat when 
executing the clustering plan in `cluster`
   - `rollbackFailedWrites` should schedule and execute rollback plan of any 
incomplete clustering instants with expired heartbeat
   - provide a new utility API `rollbackClusteringWithMatchingPartitions` that 
takes a list of partitions and attempts a rollback of any incomplete clustering 
plans targeting the same partition (provided they have expired heartbeat)
   
   Note that we are making the following assumptions
   - The value of this config key is the same across all writers to a dataset 
(given a multi-writer setup)
   - If two writers attempt to rollback the same instant, there will be no data 
corruption nor exception thrown.
   
   **Why this task is needed** 
   For our datasets, we do not perform clustering in the same writer job as 
ingestion writes. Rather, we orchestrate separate service that attempt 
clustering. In addition, we can have concurrent jobs attempt to cluster the 
same datasets with different types of clustering plans (for example, we may 
stitch files in newer and older partitions with different strategies). Given 
this, we had to implement the above requirements in our internal HUDI 0.x build 
in order to ensure:
   - if we cannot re-execute an existing clustering plan (due more spark 
resources that are currently available), then we can ensure these inflight 
plans are rolled back without blocking archival/clean/metadata table services. 
Since for us clustering typically has a lower priority than other operations.
   - When HUDI schedules a clustering plan, it will ignore files already 
targeted by other incomplete clustering plans. This is desired behavior, but it 
means that our clustering service job must be able to rollback any existing 
"leftover" clustering plans targeting the same partition before proceeding to 
create a new clustering plan. Since the current job may not have sufficient 
spark resourcing to re-attempt the plan, and the execution of this "leftover" 
plan may not anyway be of high priority. 
   
   Once we achieve consensus on these requirements, we can start upstreaming 
our implementations
   
   ### Task Type
   
   Code improvement/refactoring
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to