[ 
https://issues.apache.org/jira/browse/HDDS-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated HDDS-7989:
-----------------------------------
    Status: Patch Available  (was: In Progress)

> UnhealthyReplicationProcessor retries failure without delay
> -----------------------------------------------------------
>
>                 Key: HDDS-7989
>                 URL: https://issues.apache.org/jira/browse/HDDS-7989
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 1.4.0
>            Reporter: Attila Doroszlai
>            Assignee: Attila Doroszlai
>            Priority: Major
>              Labels: pull-request-available
>
> {{UnhealthyReplicationProcessor#processAll}} requeues any failed task.  Such 
> tasks are attempted in the same {{processAll}} call.  This can flood SCM logs 
> until the cause of the error is resolved.
> Example steps:
> # Start cluster with 5 datanodes
> # Create EC(3,2) key
> # Stop two datanodes
> # Wait until SCM starts emitting error for the same container
> {code}
> scm_1       | 2023-02-17 18:08:51,091 [Under Replicated Processor] WARN 
> replication.ECUnderReplicationHandler: Exception while processing for 
> creating the EC reconstruction container commands for #5.
> scm_1       | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough 
> datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 
> ExcludedNode = 3
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
> scm_1       |         at java.base/java.lang.Thread.run(Thread.java:829)
> scm_1       | 2023-02-17 18:08:51,091 [Under Replicated Processor] ERROR 
> replication.UnhealthyReplicationProcessor: Error processing Health result of 
> class: class 
> org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
>  for container ContainerInfo{id=#5, state=CLOSED, 
> pipelineID=PipelineID=0ccdaf17-dc73-4974-a660-c2bb51a3402e, 
> stateEnterTime=2023-02-17T17:59:05.707Z, owner=om1}
> scm_1       | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough 
> datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 
> ExcludedNode = 3
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
> scm_1       |         at 
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
> scm_1       |         at java.base/java.lang.Thread.run(Thread.java:829)
> ...
> No space left on device
> {code}
> The same messages are repeated without any delay.
> I think tasks should be collected and requeued outside of the processing loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to