[jira] [Updated] (HUDI-8886) Fix leftover/lingering rollbacks for table services

sivabalan narayanan (Jira) Fri, 17 Jan 2025 10:30:22 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-8886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-8886:
--------------------------------------
    Description: 
when a table service fails, and is re-attempted, we trigger a rollback followed 
by re-attempting the table service. w/ clustering, we have a way to nuke the 
entire clustering plan. 
[https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
 

 

but there are chances we end up in below state

 

t6.rc.req

t6.rc.inflight 

// trigger rollback. 

t7.rb.req

t7.rb.inflig

delete all data files from t6.rc

write to metadata table 

delete t6.rc* files from timeline 

and we crash. 

 

So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in the 
timeline forever. 

 

This might only be an issue when we try to nuking the entire plan. Otherwise, 
t6.rc.req will never be cleaned up and next attempt will retry it. 

 

 

 

 

  was:
when a table service fails, and is re-attempted, we trigger a rollback followed 
by re-attempting the table service. w/ clustering, we have a way to nuke the 
entire clustering plan. 
[https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
 

 

but there are chances we end up in below state

 

t6.rc.req

t6.rc.inflight 

// trigger rollback. 

t7.rb.req

t7.rb.inflig

delete all data files from t6.rc

write to metadata table 

delete t6.rc* files from timeline 

and we crash. 

 

So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in the 
timeline forever. 

 

 

 


> Fix leftover/lingering rollbacks for table services
> ---------------------------------------------------
>
>                 Key: HUDI-8886
>                 URL: https://issues.apache.org/jira/browse/HUDI-8886
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: table-service
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> when a table service fails, and is re-attempted, we trigger a rollback 
> followed by re-attempting the table service. w/ clustering, we have a way to 
> nuke the entire clustering plan. 
> [https://github.com/apache/hudi/blob/baf141abbd6da022c66fa518588e34452a6902b4/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L142]
>  
>  
> but there are chances we end up in below state
>  
> t6.rc.req
> t6.rc.inflight 
> // trigger rollback. 
> t7.rb.req
> t7.rb.inflig
> delete all data files from t6.rc
> write to metadata table 
> delete t6.rc* files from timeline 
> and we crash. 
>  
> So, we might have a lingering rollback plan t7.rb.req and t7.rb.inflight in 
> the timeline forever. 
>  
> This might only be an issue when we try to nuking the entire plan. Otherwise, 
> t6.rc.req will never be cleaned up and next attempt will retry it. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8886) Fix leftover/lingering rollbacks for table services

Reply via email to