lfxy opened a new pull request, #7876:
URL: https://github.com/apache/hadoop/pull/7876

   In our cluster with observer NNs, when the standby NN is doing a checkpoint 
and sending the fsimage to other NNs, if the sending fails of one NN due to 
network anomalies, NN restarts, or other exceptions, the standby will consider 
this Checkpoint as failed and does not update the lastCheckpointTime, and retry 
checkpoints. 
   However, the active or observer NNs which successfully received the fsimage 
has update their lastCheckpointTime, and the NN which receive fsimage failed 
don't update its lastCheckpointTime, resulting in inconsistent 
lastCheckpointTime across the NNs. This causes subsequent checkpoints to 
repeatedly fail to send fsimage to part or all active or observer NNs, because 
they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition. 
   Then the SNN will always failed to do checkpoint and repeat retry. I think 
that the SNN should consider the checkpoint successful and update its 
lastCheckpointTime if the fsimage transmission succeeds on at least half of the 
NNs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to