sodonnel opened a new pull request #2268:
URL: https://github.com/apache/ozone/pull/2268
## What changes were proposed in this pull request?
During testing we came across an issue with ICR and FCR handing.
The following log shows the issue:
```
2021-05-18 13:14:15,394 DEBUG
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica
of container #1 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip:
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
STANDALONE=9859], networkLocation: /default, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG
org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler:
Processing replica of container #1001 from datanode
945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host:
quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886,
RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE,
persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica
of container #2 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip:
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
STANDALONE=9859], networkLocation: /default, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica
of container #3 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip:
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
STANDALONE=9859], networkLocation: /default, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-05-18 13:14:15,394 DEBUG
org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica
of container #4 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip:
172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports:
[REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856,
STANDALONE=9859], networkLocation: /default, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
...
```
In the above log, SCM is processing both an ICR and FCR for the same
Datanode at the same time. The FCR does not container container #1001.
The FCR starts first, and it takes a snapshot of the containers on the node
via NodeManager.
Then it starts processing the containers one by one.
The ICR then starts, and it added #1001 to the ContainerManager and to the
NodeManager.
When the FCR completes, it replaces the list of containers in NodeManager
with those in the FCR.
At this point, container #1001 is in the ContainerManager, but it is not
listed against the node in NodeManager.
This would get fixed by the next FCR, but then the node goes dead. The dead
node handler runs and uses the list of containers in NodeManager to remove all
containers for the node. As #1001 is not listed, it is not removed by the
DeadNodeManager. This means the container will never been seen as under
replicated, as 3 copies will exist forever in the ContainerManager.
This issue is quite tricky to fully fix. There are two issues:
1. Parallel processing of ICR and FCR can lead to data inconsistency between
the ComtainerManager and NodeManager. This is what caused the bug above.
2. A FCR wiping out a reference to a container recently sent in an ICR, but
which is not included in the FCR.
The second issue is less serious, as the next FCR will fix the problem, as
the FCRs are produced approximately every 60 seconds by default.
We can fix problem 1 quite easily by synchronising on the datanode when
processing FCRs and ICRs, that will ensure the data inconsistency will not
happen.
This PR is for issue 1, and we should probably create a followup issue for 2.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5249
## How was this patch tested?
Added a new test to reproduce the race condition and verified it passes
after the code change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]