Sammi Chen created HDDS-5794:
--------------------------------
Summary: The misleading "No available thread in pool for past
*sencond" message in DN StateContext
Key: HDDS-5794
URL: https://issues.apache.org/jira/browse/HDDS-5794
Project: Apache Ozone
Issue Type: Bug
Reporter: Sammi Chen
Assignee: Sammi Chen
There are a lot of following messages in DN log file, which makes me think
there is something wrong with the heartbeat channel between the DN and SCM.
2021-09-28 10:59:36,871 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,345 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,379 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,584 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,598 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,631 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,650 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,652 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,668 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,741 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,918 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,925 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,930 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,954 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,955 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:37,995 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:38,005 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:38,097 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:38,114 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
2021-09-28 10:59:38,174 [Datanode State Machine Thread - 0] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
After add some extra debug info, I found the thing is not as the LOG shows,
the heartbeat channel is very healthy and this "No available thread in pool for
past 30 seconds" is quite misleading. The corrent part of the LOG is "No
available thread in pool", the wrong part is "for the past 30 seconds".
2021-09-29 11:28:01,257 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
heartbeat interval : 30000, next HB 6719157366
2021-09-29 11:28:01,257 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to
execute the task 24
2021-09-29 11:28:01,257 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Sending heartbeat message
2021-09-29 11:28:01,261 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Received heartbeat response
2021-09-29 11:28:01,261 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Finished
executing the task 24
2021-09-29 11:28:01,261 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
Sleep 29996 ms
2021-09-29 11:28:05,622 [ChunkWriter-125-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-71-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
heartbeat interval : 30000, next HB 6719161731
2021-09-29 11:28:05,622 [ChunkWriter-165-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-53-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to
execute the task 25
2021-09-29 11:28:05,622 [ChunkWriter-31-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-41-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-25-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] ERROR
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
execute task.wait is interrupted.
2021-09-29 11:28:05,623 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
heartbeat interval : 30000, next HB 6719161732
2021-09-29 11:28:05,623 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to
execute the task 26
2021-09-29 11:28:05,623 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Sending heartbeat message
2021-09-29 11:28:05,641 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Received heartbeat response
2021-09-29 11:28:05,642 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Sending heartbeat message
2021-09-29 11:28:05,652 [EndpointStateMachine task thread for
/9.186.21.245:9861 - 0 ] INFO
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask:
Received heartbeat response
2021-09-29 11:28:05,652 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Finished
executing the task 26
2021-09-29 11:28:05,652 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
Sleep 29971 ms
2021-09-29 11:28:05,709 [ChunkWriter-73-0] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
heartbeat interval : 30000, next HB 6719161818
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to
execute the task 27
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] ERROR
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
execute task.wait is interrupted.
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine:
heartbeat interval : 30000, next HB 6719161818
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] WARN
org.apache.hadoop.ozone.container.common.statemachine.StateContext: No
available thread in pool for past 30 seconds.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]