[
https://issues.apache.org/jira/browse/HDDS-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468923#comment-17468923
]
Ritesh H Shukla commented on HDDS-5267:
---------------------------------------
The revised approach is as follows (2 commits)
# On the DN, if a HB has both ICR and FCR, then merge the changes into FCR and
send the HB. Add serialization fixes around generation of ICR and FCR.
# On SCM side FCR and ICR from the same DN need to be serialized correctly.
The simplest way without introducing a sequence number would be to have a
single thread dequeue the events for a given DN.
Thus, with the above 2 changes we should be able to correctly publish container
reports to SCM and for SCM to process them in order.
> Full Container Report can remove replicas added by an Incremental Report
> ------------------------------------------------------------------------
>
> Key: HDDS-5267
> URL: https://issues.apache.org/jira/browse/HDDS-5267
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode, SCM
> Affects Versions: 1.1.0
> Reporter: Stephen O'Donnell
> Assignee: Ritesh H Shukla
> Priority: Major
>
> In HDDS-5249, I highlighted an issue between Incremental and Full container
> reports. This follow-up Jira is to trace the second problem mentioned in
> that Jira.
> After HDDS-5249, the report processing for a given DN on SCM in synchronised
> so only 1 report can process at a time for a given DN.
> We can still have the following scenario:
> 1. FCR generated on DN, including containers up to ID 1000.
> 2. At the same time ICR generated on DN for container 1001.
> 3. The ICR is processed first on SCM, adding 1001.
> 4. The FCR is processed, and this will cause the reference to 1001 to be
> removed as it is not in the FCR.
> 5. About 60 - 90 seconds later another FCR will be generated which will
> correct the issue.
> As things stand, there is no locking on the DN to ensure that a FCR and ICR
> cannot be generated at the same time.
> There is also no way to know that a given ICR is contained in a given FCR or
> not.
> One way to fix this problem, would be:
> 1. Introduce some locking in the DN to ensure that FCR, ICR and new container
> creation are serialized.
> 2. Introduce an increasing sequence number which is assigned to each FCR and
> ICR. If a report has a greater sequence than another one, then it supersedes
> the small one.
> Eg:
> ICR #seq=100, container=1001, FCR #seq=99. In this case, the FCR will not
> have container 1001.
> ICR #seq=99, container=1001, FCR #seq=100. In this case, the FCR is
> guaranteed to have container 1001
> Then we need to figure out a way on the DNs to use this information. One way,
> would be attaching the report sequence number to each replica, and only
> remove it if the sequence is less than the current report sequence. However
> that would add some memory overhead to SCM, so it is worth looking into
> alternatives.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]