[jira] [Created] (HDDS-5249) Race Condition between Full and Incremental Container Reports

Stephen O'Donnell (Jira) Wed, 19 May 2021 05:59:17 -0700

Stephen O'Donnell created HDDS-5249:
---------------------------------------


             Summary: Race Condition between Full and Incremental Container 
Reports
                 Key: HDDS-5249
                 URL: https://issues.apache.org/jira/browse/HDDS-5249
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
    Affects Versions: 1.1.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


During testing we came across an issue with ICR and FCR handing.

The following log shows the issue:

{code}
2021-05-18 13:14:15,394 DEBUG 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica 
of container #1 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
STANDALONE=9859], networkLocation: /default, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}


2021-05-18 13:14:15,394 DEBUG 
org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler: 
Processing replica of container #1001 from datanode 
945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: 
quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, 
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], 
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, 
persistedOpStateExpiryEpochSec: 0}


2021-05-18 13:14:15,394 DEBUG 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica 
of container #2 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
STANDALONE=9859], networkLocation: /default, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica 
of container #3 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
STANDALONE=9859], networkLocation: /default, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG 
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica 
of container #4 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: 
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, 
STANDALONE=9859], networkLocation: /default, certSerialId: null, 
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
...
{code}

In the above log, SCM is processing both an ICR and FCR for the same Datanode 
at the same time. The FCR does not container container #1001.

The FCR starts first, and it takes a snapshot of the containers on the node via 
NodeManager.

Then it starts processing the containers one by one.

The ICR then starts, and it added #1001 to the ContainerManager and to the 
NodeManager.

When the FCR completes, it replaces the list of containers in NodeManager with 
those in the FCR.

At this point, container #1001 is in the ContainerManager, but it is not listed 
against the node in NodeManager.

This would get fixed by the next FCR, but then the node goes dead. The dead 
node handler runs and uses the list of containers in NodeManager to remove all 
containers for the node. As #1001 is not listed, it is not removed by the 
DeadNodeManager. This means the container will never been seen as under 
replicated, as 3 copies will exist forever in the ContainerManager.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-5249) Race Condition between Full and Incremental Container Reports

Reply via email to