[
https://issues.apache.org/jira/browse/HDDS-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-8390:
-----------------------------
Description:
We need to have proper synchronization between Snapshot delete/GC and other
Snapshot jobs e.g. reads from Snapshots and Snapdiff. Snapdiff is particularly
important case since it could be a long running job and in the middle of the
job, Snapshot delete/GC can kick in.
We should also have a uniform behavior in the cluster in case of a failover and
concurrent Snap-diff/Deletes. It should not happen that a leader OM node
returns certain result to a client but after a failover the new OM leader
returns different result.
---
Thus, in order to prevent client from getting partial SnapDiff result without
the client even realizing it, and to avoid explicitly holding lock, we would
want to use an approach similar to optimistic locking, by checking whether the
snapshot is still ACTIVE towards the end of the request lifetime when SnapDiff
service has already collected all the batch entires in a buffer. See the
attachment for a timeline of potential race condition:
[^35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf]
was:
We need to have proper synchronization between Snapshot delete/GC and other
Snapshot jobs e.g. reads from Snapshots and Snapdiff. Snapdiff is particularly
important case since it could be a long running job and in the middle of the
job, Snapshot delete/GC can kick in.
We should also have a uniform behavior in the cluster in case of a failover and
concurrent Snap-diff/Deletes. It should not happen that a leader OM node
returns certain result to a client but after a failover the new OM leader
returns different result.
> Synchronization between Snapshot Deletes/GC and other Snapshot jobs
> (read/diff)
> -------------------------------------------------------------------------------
>
> Key: HDDS-8390
> URL: https://issues.apache.org/jira/browse/HDDS-8390
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Snapshot
> Reporter: Prashant Pogde
> Assignee: Siyao Meng
> Priority: Major
> Attachments: 35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf
>
>
> We need to have proper synchronization between Snapshot delete/GC and other
> Snapshot jobs e.g. reads from Snapshots and Snapdiff. Snapdiff is
> particularly important case since it could be a long running job and in the
> middle of the job, Snapshot delete/GC can kick in.
> We should also have a uniform behavior in the cluster in case of a failover
> and concurrent Snap-diff/Deletes. It should not happen that a leader OM node
> returns certain result to a client but after a failover the new OM leader
> returns different result.
> ---
> Thus, in order to prevent client from getting partial SnapDiff result without
> the client even realizing it, and to avoid explicitly holding lock, we would
> want to use an approach similar to optimistic locking, by checking whether
> the snapshot is still ACTIVE towards the end of the request lifetime when
> SnapDiff service has already collected all the batch entires in a buffer. See
> the attachment for a timeline of potential race condition:
> [^35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]