lfxy opened a new pull request, #7845: URL: https://github.com/apache/hadoop/pull/7845
The capacity of Our hdfs federation cluster are more then 500 PB, with one NS containing over 600 million files. Once checkpoint takes nearly two hours. We discover checkpoint frequently failures due to fail to put the fsimage to the active Namenode, leading to repeat checkpoints. We configured dfs.recent.image.check.enabled=true. After debug, the reason is the standby NN updates the lastCheckpointTime use the start time of checkpoint, rather than the end time. In our cluster, the lastCheckpointTime of the standby node is approximately 80 minutes ahead of the lastCheckpointTime of the active NN. When the checkpoint interval in standby NN exceeds dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the active NN's lastCheckpointTime is later than standby NN's, the interval is less than dfs.namenode.checkpoint.period, and the putting fsimage is been rejected, causing the checkpoint to fail and retried. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org