kbuci commented on issue #17879:
URL: https://github.com/apache/hudi/issues/17879#issuecomment-3776026228
Sorry let me clarify; our goal is
- have a way for clean to automatically rollback (and delete plans of)
clustering writes that have failed.
- a new utility API that takes a list of partitions and attempts a rollback
(and delete plans of) of any failed clustering writes targeting the same
partition
Normally this would be unsafe (without implementing the RFC) due to the
multi writer cases you mentioned. But we are able to implement this cleanup
logic internally since for us:
- Clustering is only attempted by a dedicated table service writer job that
directly calls clustering APIs. In addition, we ensure that scheduling and
execution will happen in the same job.
- We don't enable clustering "inline" within write commit / deltastreamer
Because of this, in our internal build we can just start a heartbeat in the
schedule clustering call. And similar to ingestion writes, once heartbeat is
expired HUDI can assume that the plan is safe to rollback and delete (either by
clean or the aforementioned utility API)
> Are you asking for the proposed RFC to be implemented.
No we don't have to implement the RFC for this, although that would be an
ideal long term solution to maintain our clustering flow with existing OSS
flow.
> You are looking for support to rollback and nuke an existing clustering
plan(already scheduled).
This should not be a big ask, assuming you can enable the config for just 1
of the dedicated table service writer and whenever it detects a pending
clustering plan in the timeline, it could rollback and nuke the plan.
Yes, though both the clustering and clean jobs should have this config
enabled. Alternatively this can be made as a table level config.
> But chances that another concurrent ingestion writer could result in file
not found issue which needs to be tackled.
Yes thats right, we have an internal fix we can upstream for this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]