guihecheng commented on pull request #2471:
URL: https://github.com/apache/ozone/pull/2471#issuecomment-889574064


   > Thanks for taking a look at this issue @guihecheng. I am not sure the 
interrupt reported in the Jira came from 
`DatanodeStateMachine#triggerHeartbeat`. Below is my current understanding, let 
me know what you think:
   > 
   > * `DatanodeStateMachine#triggerHeartbeat` Is only called when the 
container services have started.
   
   If I read the ratis code correctly, this is not quite true, actually a 
triggerHeartbeat call can be potentially make at 
`ContainerStateMachine#initialize` which is called in RaftServer construction, 
before the `OzoneContainer#start`.
   And this is why we have the check 'stateMachineThread != null' in 
triggerHeartbeat.
   
   > * Container services are not started until `OzoneContainer#start` is 
called by `VersionEndpointTask#call`
   > * `VersionEndpointTask#call` will not occur until the datanode endpoint 
tasks are registered by `context#execute` inside the main loop of 
`DatanodeStateMachine#start`.
   > * At this point, pre-finalize upgrade actions must have finished.
   > 
   > **If this is correct** then the interrupt could not have come from 
`DatanodeStateMachine#triggerHeartbeat`, and we need a new fix to determine 
where the interrupt came from.
   > 
   > **Else if this is not correct** then we need to figure out why 
`DatanodeStateMachine#triggerHeartbeat`, and possibly all container services, 
are being called/started while the pre-finalize upgrade actions are running. If 
this is really happening it is an error and the fix will be to have all 
pre-finalize actions run before container services are started.
   > 
   > Also, those ratis log messages shared in the jira that occur before the 
pre-finalize actions appear to come from normal raft server construction. I do 
not think they indicate that the raft server was actually started when they 
were printed, since that should not happen until `OzoneContainer#start` is 
called I believe.
   
   The raft server was not started indeed, and there is a potential 
triggerHeartbeat in the construction code as said above.
   And there is a point to check in the logs:
   ```
   2021-07-27 20:11:59,507 [pool-92215-thread-1] INFO 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
 group-9034B3A2567B: The snapshot info is null. Setting the last applied 
indexto:(t:0, i:~)
   ```
   This is from the function call `ContainerStateMachine#loadSnapshot` which is 
only called in `ContainerStateMachine#initialize` where the triggerHeartbeat 
potentially happen.
   
   Since the log entry above got printed right at second where the Exception 
got thrown, I suspect that the interrupt() comes from the place in the patch.
   
   > Do you have any way to reproduce this issue or help verify where the 
interrupt came from?
   
   Thanks @errose28 for a detailed check, we only encoutered this problem once 
during a non-rolling upgrade, about 4 in 40 nodes reported this, and I can't 
reproduce this problem in my test deployment since it is hard for a thread to 
exactly catch the interrupt() send by another thread.
   And some inline replies above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to