[PR] HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time [hadoop]

via GitHub Thu, 31 Jul 2025 10:20:50 -0700


lfxy opened a new pull request, #7845:
URL: https://github.com/apache/hadoop/pull/7845


   The capacity of Our hdfs federation cluster are more then 500 PB, with one 
NS containing over 600 million files. Once checkpoint takes nearly two hours.
   
   We discover checkpoint frequently failures due to fail to put the fsimage to 
the active Namenode, leading to repeat checkpoints. We configured 
dfs.recent.image.check.enabled=true. After debug, the reason is the standby NN 
updates the lastCheckpointTime use the start time of checkpoint, rather than 
the end time. In our cluster, the lastCheckpointTime of the standby node is 
approximately 80 minutes ahead of the lastCheckpointTime of the active NN.
   
   When the checkpoint interval in standby NN exceeds 
dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the 
active NN's lastCheckpointTime is later than standby NN's, the interval is less 
than dfs.namenode.checkpoint.period, and the putting fsimage is been rejected, 
causing the checkpoint to fail and retried.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[PR] HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time [hadoop]

Reply via email to