[ 
https://issues.apache.org/jira/browse/HDDS-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468923#comment-17468923
 ] 

Ritesh H Shukla commented on HDDS-5267:
---------------------------------------

The revised approach is as follows (2 commits)
 # On the DN, if a HB has both ICR and FCR, then merge the changes into FCR and 
send the HB. Add serialization fixes around generation of ICR and FCR.
 # On SCM side FCR and ICR from the same DN need to be serialized correctly. 
The simplest way without introducing a sequence number would be to have a 
single thread dequeue the events for a given DN. 

Thus, with the above 2 changes we should be able to correctly publish container 
reports to SCM and for SCM to process them in order.

> Full Container Report can remove replicas added by an Incremental Report
> ------------------------------------------------------------------------
>
>                 Key: HDDS-5267
>                 URL: https://issues.apache.org/jira/browse/HDDS-5267
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode, SCM
>    Affects Versions: 1.1.0
>            Reporter: Stephen O'Donnell
>            Assignee: Ritesh H Shukla
>            Priority: Major
>
> In HDDS-5249, I highlighted an issue between Incremental and Full container 
> reports. This follow-up Jira is to trace the second  problem mentioned in 
> that Jira.
> After HDDS-5249, the report processing for a given DN on SCM in synchronised 
> so only 1 report can process at a time for a given DN.
> We can still have the following scenario:
> 1. FCR generated on DN, including containers up to ID 1000.
> 2. At the same time ICR generated on DN for container 1001.
> 3. The ICR is processed first on SCM, adding 1001.
> 4. The FCR is processed, and this will cause the reference to 1001 to be 
> removed as it is not in the FCR.
> 5. About 60 - 90 seconds later another FCR will be generated which will 
> correct the issue.
> As things stand, there is no locking on the DN to ensure that a FCR and ICR 
> cannot be generated at the same time.
> There is also no way to know that a given ICR is contained in a given FCR or 
> not.
> One way to fix this problem, would be:
> 1. Introduce some locking in the DN to ensure that FCR, ICR and new container 
> creation are serialized.
> 2. Introduce an increasing sequence number which is assigned to each FCR and 
> ICR. If a report has a greater sequence than another one, then it supersedes 
> the small one.
> Eg:
>   ICR #seq=100, container=1001, FCR #seq=99. In this case, the FCR will not 
> have container 1001.
>   ICR #seq=99, container=1001, FCR #seq=100. In this case, the FCR is 
> guaranteed to have container 1001
> Then we need to figure out a way on the DNs to use this information. One way, 
> would be attaching the report sequence number to each replica, and only 
> remove it if the sequence is less than the current report sequence. However 
> that would add some memory overhead to SCM, so it is worth looking into 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to