[
https://issues.apache.org/jira/browse/HUDI-8886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-8886:
--------------------------------------
Description:
when a table service fails, and is re-attempted, we trigger a rollback followed
by re-attempting the table service. w/ clustering, we have a way to nuke the
entire clustering plan.
[https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
but there are chances we end up in below state
t6.rc.req
t6.rc.inflight
// trigger rollback.
t7.rb.req
t7.rb.inflig
delete all data files from t6.rc
write to metadata table
delete t6.rc* files from timeline
and we crash.
So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in the
timeline forever.
This might only be an issue when we try to nuking the entire plan. Otherwise,
t6.rc.req will never be cleaned up and next attempt will retry it.
was:
when a table service fails, and is re-attempted, we trigger a rollback followed
by re-attempting the table service. w/ clustering, we have a way to nuke the
entire clustering plan.
[https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
but there are chances we end up in below state
t6.rc.req
t6.rc.inflight
// trigger rollback.
t7.rb.req
t7.rb.inflig
delete all data files from t6.rc
write to metadata table
delete t6.rc* files from timeline
and we crash.
So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in the
timeline forever.
> Fix leftover/lingering rollbacks for table services
> ---------------------------------------------------
>
> Key: HUDI-8886
> URL: https://issues.apache.org/jira/browse/HUDI-8886
> Project: Apache Hudi
> Issue Type: Bug
> Components: table-service
> Reporter: sivabalan narayanan
> Priority: Major
>
> when a table service fails, and is re-attempted, we trigger a rollback
> followed by re-attempting the table service. w/ clustering, we have a way to
> nuke the entire clustering plan.
> [https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
>
>
> but there are chances we end up in below state
>
> t6.rc.req
> t6.rc.inflight
> // trigger rollback.
> t7.rb.req
> t7.rb.inflig
> delete all data files from t6.rc
> write to metadata table
> delete t6.rc* files from timeline
> and we crash.
>
> So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in
> the timeline forever.
>
> This might only be an issue when we try to nuking the entire plan. Otherwise,
> t6.rc.req will never be cleaned up and next attempt will retry it.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)