[jira] [Updated] (HDDS-11452) OmSnapshotPurgeRequest is not atomic and can lead to SnapshotChain Corruption

Swaminathan Balachandran (Jira) Wed, 11 Sep 2024 08:11:27 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Swaminathan Balachandran updated HDDS-11452:
--------------------------------------------
    Description: 
OmSnapshotPurgeRequest updates the snapshot chain and also updates the cache & 
in case of any failure these changes are not rolled back. In case of checked 
exception thrown(This could be any exception ranging from proto exception or 
any random IOException), the request gobbles up the exception and returns an 
error response. The problem with this is, we have partially updated snapshot 
info table cache which is not coherrent with the snapshot chain and all these 
changes won't be flushed to disk. On restart this could lead to all sorts of 
snapshot chain & snapshot info corruption. 

The proposal here is to make the entire request atomic:

1) Update the snapshot chain & maintain the updated snapshot infos in local 
uncommitted space.

2) In case of an exception, roll back all deleted snapshots by putting it back 
to the snapshot chain(P.S. this needs to be done in the reverse order of 
removal) & return an error response.

3) If no exception is thrown, update the snapshot info table cache.

4) Send it to double buffer

cc: [~hemantk] [~ppogde] 

  was:
OmSnapshotPurgeRequest updates the snapshot chain and also updates the cache & 
in case of any failure. In case of checked exception thrown(This could be any 
exception ranging from proto exception or any random IOException), the request 
gobbles up the exception and returns an error response. The problem with this 
is, we have partially updated snapshot info table cache which is not coherrent 
with the snapshot chain and all these changes won't be flushed to disk. On 
restart this could lead to all sorts of snapshot chain & snapshot info 
corruption. 

The proposal here is to make the entire request atomic:

1) Update the snapshot chain & maintain the updated snapshot infos in local 
uncommitted space.

2) In case of an exception, roll back all deleted snapshots by putting it back 
to the snapshot chain(P.S. this needs to be done in the reverse order of 
removal) & return an error response.

3) If no exception is thrown, update the snapshot info table cache.

4) Send it to double buffer

cc: [~hemantk] [~ppogde] 


> OmSnapshotPurgeRequest is not atomic and can lead to SnapshotChain Corruption
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-11452
>                 URL: https://issues.apache.org/jira/browse/HDDS-11452
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Swaminathan Balachandran
>            Assignee: Swaminathan Balachandran
>            Priority: Major
>              Labels: pull-request-available
>
> OmSnapshotPurgeRequest updates the snapshot chain and also updates the cache 
> & in case of any failure these changes are not rolled back. In case of 
> checked exception thrown(This could be any exception ranging from proto 
> exception or any random IOException), the request gobbles up the exception 
> and returns an error response. The problem with this is, we have partially 
> updated snapshot info table cache which is not coherrent with the snapshot 
> chain and all these changes won't be flushed to disk. On restart this could 
> lead to all sorts of snapshot chain & snapshot info corruption. 
> The proposal here is to make the entire request atomic:
> 1) Update the snapshot chain & maintain the updated snapshot infos in local 
> uncommitted space.
> 2) In case of an exception, roll back all deleted snapshots by putting it 
> back to the snapshot chain(P.S. this needs to be done in the reverse order of 
> removal) & return an error response.
> 3) If no exception is thrown, update the snapshot info table cache.
> 4) Send it to double buffer
> cc: [~hemantk] [~ppogde] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11452) OmSnapshotPurgeRequest is not atomic and can lead to SnapshotChain Corruption

Reply via email to