[ 
https://issues.apache.org/jira/browse/HDDS-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-8390:
-----------------------------
    Description: 
We need to have proper synchronization between Snapshot delete/GC and other 
Snapshot jobs e.g. reads from Snapshots and Snapdiff.  Snapdiff is particularly 
important case since it could be a long running job and in the middle of the 
job, Snapshot delete/GC can kick in. 

We should also have a uniform behavior in the cluster in case of a failover and 
concurrent Snap-diff/Deletes. It should not happen that a leader OM node 
returns certain result to a client but after a failover the new OM leader 
returns different result.

---

Thus, in order to prevent client from getting partial SnapDiff result without 
the client even realizing it, and to avoid explicitly holding lock, we would 
want to use an approach similar to optimistic locking, by checking whether the 
snapshot is still ACTIVE towards the end of the request lifetime when SnapDiff 
service has already collected all the batch entires in a buffer. See the 
attachment for a timeline of potential race condition:  
[^35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf] 

  was:
We need to have proper synchronization between Snapshot delete/GC and other 
Snapshot jobs e.g. reads from Snapshots and Snapdiff.  Snapdiff is particularly 
important case since it could be a long running job and in the middle of the 
job, Snapshot delete/GC can kick in. 

We should also have a uniform behavior in the cluster in case of a failover and 
concurrent Snap-diff/Deletes. It should not happen that a leader OM node 
returns certain result to a client but after a failover the new OM leader 
returns different result.


> Synchronization between Snapshot Deletes/GC and other Snapshot jobs 
> (read/diff)
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-8390
>                 URL: https://issues.apache.org/jira/browse/HDDS-8390
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Snapshot
>            Reporter: Prashant Pogde
>            Assignee: Siyao Meng
>            Priority: Major
>         Attachments: 35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf
>
>
> We need to have proper synchronization between Snapshot delete/GC and other 
> Snapshot jobs e.g. reads from Snapshots and Snapdiff.  Snapdiff is 
> particularly important case since it could be a long running job and in the 
> middle of the job, Snapshot delete/GC can kick in. 
> We should also have a uniform behavior in the cluster in case of a failover 
> and concurrent Snap-diff/Deletes. It should not happen that a leader OM node 
> returns certain result to a client but after a failover the new OM leader 
> returns different result.
> ---
> Thus, in order to prevent client from getting partial SnapDiff result without 
> the client even realizing it, and to avoid explicitly holding lock, we would 
> want to use an approach similar to optimistic locking, by checking whether 
> the snapshot is still ACTIVE towards the end of the request lifetime when 
> SnapDiff service has already collected all the batch entires in a buffer. See 
> the attachment for a timeline of potential race condition:  
> [^35fdc3bd-cd0c-40f3-8fd7-2d8a8dc4643d.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to