[
https://issues.apache.org/jira/browse/HDDS-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan updated HDDS-9134:
-----------------------------------
Summary: GRPC based replication can get stuck forever if the receiver is
not available (was: Decommissioning does not complete even after 40 mins)
> GRPC based replication can get stuck forever if the receiver is not available
> -----------------------------------------------------------------------------
>
> Key: HDDS-9134
> URL: https://issues.apache.org/jira/browse/HDDS-9134
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Sumit Agrawal
> Assignee: Sumit Agrawal
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.4.0
>
>
> Decommission of a DN does not complete even after 40 mins, since there are
> still 7 replicas in under replicated state.
> We are seeing this issue across multiple runs for some of the decommissioning
> test cases.
>
> *SCM logs:*
> {noformat}
> 2023-08-04 00:51:13,994 INFO [IPC Server handler 3 on
> 9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Starting
> Decommission for node
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> 2023-08-04 00:51:13,994 INFO
> [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.node.ReadOnlyHealthyToHealthyNodeHandler:
> Datanode
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> moved to HEALTHY state.
> 2023-08-04 00:51:13,994 INFO
> [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator:
> trigger a one-shot run on RatisPipelineUtilsThread.
> 2023-08-04 00:51:13,996 WARN [RatisPipelineUtilsThread -
> 0]-org.apache.hadoop.hdds.scm.pipeline.PipelinePlacementPolicy: Pipeline
> creation failed due to no sufficient healthy datanodes. Required 3. Found 2.
> Excluded 6.
> 2023-08-04 00:51:18,083 INFO [IPC Server handler 51 on
> 9861]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Scheduling a command to
> update the operationalState persisted on
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> as the reported value (IN_SERVICE, 0) does not match the value stored in SCM
> (DECOMMISSIONING, 0)
> 2023-08-04 00:53:00,505 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
>
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> has 136 sufficientlyReplicated, 128 underReplicated and 4 unhealthy
> containers
> 2023-08-04 00:53:00,505 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> There are 1 nodes tracked for decommission and maintenance. 0 pending nodes.
> 2023-08-04 01:34:00,502 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> Under Replicated Container #26169 Container State: CLOSED, Replicas: (Count:
> 5, Healthy: 4, Decommission: 1, PendingAdd: 1), ReplicationConfig:
> EC{rs-3-2-1024k}, RemainingMaintenanceRedundancy: 1;
> Replicas{ContainerReplica{containerID=#26169, state=CLOSED,
> datanodeDetails=1a2bc52e-8694-4e07-ba14-78c5f91e1e32(quasar-xvihtz-3.quasar-xvihtz.root.hwx.site/172.27.25.10),
> placeOfBirth=1a2bc52e-8694-4e07-ba14-78c5f91e1e32, sequenceId=0, keyCount=4,
> bytesUsed=252,replicaIndex=4,
> isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED,
> datanodeDetails=7f516e1e-980b-421d-9eb7-43889e33346b(quasar-xvihtz-1.quasar-xvihtz.root.hwx.site/172.27.114.66),
> placeOfBirth=7f516e1e-980b-421d-9eb7-43889e33346b, sequenceId=0, keyCount=4,
> bytesUsed=252,replicaIndex=1,
> isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED,
> datanodeDetails=b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136),
> placeOfBirth=ee6926fa-1969-4e9a-bd43-e328c3b21a2f, sequenceId=0, keyCount=4,
> bytesUsed=252,replicaIndex=5,
> isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED,
> datanodeDetails=240fe27d-3c00-4413-a775-6b8894980d8c(quasar-xvihtz-4.quasar-xvihtz.root.hwx.site/172.27.186.70),
> placeOfBirth=240fe27d-3c00-4413-a775-6b8894980d8c, sequenceId=0, keyCount=4,
> bytesUsed=0,replicaIndex=2,
> isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED,
> datanodeDetails=53b8df38-70c4-459f-a3c9-83fce7d947e5(quasar-xvihtz-5.quasar-xvihtz.root.hwx.site/172.27.103.128),
> placeOfBirth=53b8df38-70c4-459f-a3c9-83fce7d947e5, sequenceId=0, keyCount=4,
> bytesUsed=0,replicaIndex=3, isEmpty=false}}
> 2023-08-04 01:34:00,502 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
>
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> has 256 sufficientlyReplicated, 7 underReplicated and 0 unhealthy containers
> 2023-08-04 01:34:00,502 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> There are 1 nodes tracked for decommission and maintenance. 0 pending nodes.
> 2023-08-04 01:34:25,934 INFO [IPC Server handler 44 on
> 9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Queued node
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> for recommission
> 2023-08-04 01:34:30,501 INFO
> [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> Recommissioned node
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> 2023-08-04 01:34:30,501 INFO
> [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.node.ReadOnlyHealthyToHealthyNodeHandler:
> Datanode
> b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
> moved to HEALTHY state.{noformat}
> The decommissioning was submitted at 00:51:13 and the
> DatanodeAdminMonitorImpl identified 128 under replicated and 4 unhealthy
> containers. But at 01:34:00 after more than 40 mins there were still 7 under
> replicated containers left.
> The test case then aborted the decommissioning command and recommissioned the
> DN.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]