Attila Doroszlai created HDDS-7989:
--------------------------------------
Summary: UnhealthyReplicationProcessor retries failure without
delay
Key: HDDS-7989
URL: https://issues.apache.org/jira/browse/HDDS-7989
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Affects Versions: 1.4.0
Reporter: Attila Doroszlai
{{UnhealthyReplicationProcessor#processAll}} requeues any failed task. Such
tasks are attempted in the same {{processAll}} call. This can flood SCM logs
until the cause of the error is resolved.
Example steps:
# Start cluster with 5 datanodes
# Create EC(3,2) key
# Stop two datanodes
# Wait until SCM starts emitting error for the same container
{code}
scm_1 | 2023-02-17 18:08:51,091 [Under Replicated Processor] WARN
replication.ECUnderReplicationHandler: Exception while processing for creating
the EC reconstruction container commands for #5.
scm_1 | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2
ExcludedNode = 3
scm_1 | at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
scm_1 | at java.base/java.lang.Thread.run(Thread.java:829)
scm_1 | 2023-02-17 18:08:51,091 [Under Replicated Processor] ERROR
replication.UnhealthyReplicationProcessor: Error processing Health result of
class: class
org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
for container ContainerInfo{id=#5, state=CLOSED,
pipelineID=PipelineID=0ccdaf17-dc73-4974-a660-c2bb51a3402e,
stateEnterTime=2023-02-17T17:59:05.707Z, owner=om1}
scm_1 | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough
datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2
ExcludedNode = 3
scm_1 | at
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
scm_1 | at
org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
scm_1 | at
org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
scm_1 | at java.base/java.lang.Thread.run(Thread.java:829)
...
No space left on device
{code}
The same messages are repeated without any delay.
I think tasks should be collected and requeued outside of the processing loop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]