guihecheng commented on pull request #2471: URL: https://github.com/apache/ozone/pull/2471#issuecomment-889574064
> Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from `DatanodeStateMachine#triggerHeartbeat`. Below is my current understanding, let me know what you think: > > * `DatanodeStateMachine#triggerHeartbeat` Is only called when the container services have started. If I read the ratis code correctly, this is not quite true, actually a triggerHeartbeat call can be potentially make at `ContainerStateMachine#initialize` which is called in RaftServer construction, before the `OzoneContainer#start`. And this is why we have the check 'stateMachineThread != null' in triggerHeartbeat. > * Container services are not started until `OzoneContainer#start` is called by `VersionEndpointTask#call` > * `VersionEndpointTask#call` will not occur until the datanode endpoint tasks are registered by `context#execute` inside the main loop of `DatanodeStateMachine#start`. > * At this point, pre-finalize upgrade actions must have finished. > > **If this is correct** then the interrupt could not have come from `DatanodeStateMachine#triggerHeartbeat`, and we need a new fix to determine where the interrupt came from. > > **Else if this is not correct** then we need to figure out why `DatanodeStateMachine#triggerHeartbeat`, and possibly all container services, are being called/started while the pre-finalize upgrade actions are running. If this is really happening it is an error and the fix will be to have all pre-finalize actions run before container services are started. > > Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until `OzoneContainer#start` is called I believe. The raft server was not started indeed, and there is a potential triggerHeartbeat in the construction code as said above. And there is a point to check in the logs: ``` 2021-07-27 20:11:59,507 [pool-92215-thread-1] INFO org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine: group-9034B3A2567B: The snapshot info is null. Setting the last applied indexto:(t:0, i:~) ``` This is from the function call `ContainerStateMachine#loadSnapshot` which is only called in `ContainerStateMachine#initialize` where the triggerHeartbeat potentially happen. Since the log entry above got printed right at second where the Exception got thrown, I suspect that the interrupt() comes from the place in the patch. > Do you have any way to reproduce this issue or help verify where the interrupt came from? Thanks @errose28 for a detailed check, we only encoutered this problem once during a non-rolling upgrade, about 4 in 40 nodes reported this, and I can't reproduce this problem in my test deployment since it is hard for a thread to exactly catch the interrupt() send by another thread. And some inline replies above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
