[
https://issues.apache.org/jira/browse/HDDS-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832955#comment-17832955
]
Hemant Kumar commented on HDDS-9198:
------------------------------------
We found two race conditions issues HDDS-10524 and HDDS-10590.
There is still an issue with the existing way batch snapshot purge is processed.
As part of the snapshot purge, the deep clean flag of the next active snapshot,
and global and path previous of the next snapshots get updated. For this,
updatedSnapInfos and updatedPathPreviousAndGlobalSnapshots maps are maintained
in
[OMSnapshotPurgeRequest|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotPurgeRequest.java#L101-L104],
and then these maps are flushed sequentially in
[OMSnapshotPurgeResponse|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotPurgeResponse.java].
There is a problem with that and can cause chain corruption. Let's understand
this with an example:
Let's assume as part of deep clean info update, snapshots are updated as \{E ->
E', F -> F', B' -> B'', G -> G'} and kept in updatedSnapInfos: [E', F', B'',
G'] and previous snapshots are updated as \{A - > A', B -> B', C -> C', D ->
D'} and kept in updatedPathPreviousAndGlobalSnapshots: [A', B', C', D'].
After the purge final snapshot list should be [A', B'', C', D', E', F', G'] but
because these maps are added to the batch sequentially [A', B', C', D', E', F',
G'] or [A', B'', C', D', E', F', G'] depending on which one is added to the
batch first
[code|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotPurgeResponse.java#L83-L84].
The problem can still exist even if you fix the order of maps flush.
Ideally, these should be flushed in the same order the purge batch is processed.
One easy and simple fix for it is to change the batch request to a single
snapshot purge at a time. If we believe batch purge is more optimized in terms
of the Ratis transactions, then we need to introduce a new object to maintain
the order.
> Snapshot purge should be a atomic operation
> -------------------------------------------
>
> Key: HDDS-9198
> URL: https://issues.apache.org/jira/browse/HDDS-9198
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Hemant Kumar
> Assignee: Aswin Shakil
> Priority: Major
>
> After [HDDS-8665|https://issues.apache.org/jira/browse/HDDS-8665], there is a
> possibility that [snapshot
> cache|https://github.com/apache/ozone/pull/5201/files#diff-a424f5d3db1b2b8c0ffede0b757478c8ab646ea7f7990fd13f36f0346e6a73e0R102]
> gets updated but [snapshot chain
> update|https://github.com/apache/ozone/pull/5201/files#diff-a424f5d3db1b2b8c0ffede0b757478c8ab646ea7f7990fd13f36f0346e6a73e0R104]
> fails and leave it in a situation that snapshot's previous snapshot is
> pointing to something which doesn't exist or order is messed.
> We need to revisit this and see if there is any race condition issue in
> snapshot purge.
> One possible solution is, snapshot purge should be single snapshot purge
> request instead of batch request.
> Other thing is we may need locking in snapshot purge request handler.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]