kbuci commented on issue #17879:
URL: https://github.com/apache/hudi/issues/17879#issuecomment-3776026228

   Sorry let me clarify; our goal is
   - have a way for clean to automatically rollback (and delete plans of) 
clustering writes that have failed. 
   - a new utility API that takes a list of partitions and attempts a rollback 
(and delete plans of) of any failed clustering writes targeting the same 
partition 
   Normally this would be unsafe (without implementing the RFC) due to the 
multi writer cases you mentioned. But we are able to implement this cleanup 
logic internally since for us:
   - Clustering is only attempted by a dedicated table service writer job that 
directly calls clustering APIs. In addition, we ensure that scheduling and 
execution will happen in the same job. 
   - We don't enable clustering "inline" within write commit / deltastreamer
   
    Because of this, in our internal build we can just start a heartbeat in the 
schedule clustering call. And similar to ingestion writes, once heartbeat is 
expired HUDI can assume that the plan is safe to rollback and delete (either by 
clean or the  aforementioned utility API)
   
   > Are you asking for the proposed RFC to be implemented. 
   
   No we don't have to implement the RFC for this, although that would be an 
ideal long term solution to maintain our clustering flow with existing OSS 
flow. 
   
   > You are looking for support to rollback and nuke an existing clustering 
plan(already scheduled).
   This should not be a big ask, assuming you can enable the config for just 1 
of the dedicated table service writer and whenever it detects a pending 
clustering plan in the timeline, it could rollback and nuke the plan. 
   
   Yes, though both the clustering and clean jobs should have this config 
enabled. Alternatively this can be made as a table level config.
   
   > But chances that another concurrent ingestion writer could result in file 
not found issue which needs to be tackled.
   Yes thats right, we have an internal fix we can upstream for this
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to