caozhiqiang created HDFS-17815:
----------------------------------

             Summary: Fix upload fsimage failure when checkpoint takes a long 
time
                 Key: HDFS-17815
                 URL: https://issues.apache.org/jira/browse/HDFS-17815
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.5.0
            Reporter: caozhiqiang
            Assignee: caozhiqiang


The capacity of Our hdfs federation cluster are more then 500 PB, with one NS 
containing over 600 million files. Once checkpoint takes nearly two hours.

We discover checkpoint frequently failures due to fail to put the fsimage to 
the active Namenode, leading to repeat checkpoints. We configured 
dfs.recent.image.check.enabled=true. After debug, the reason is the standby NN 
updates the lastCheckpointTime use the start time of checkpoint, rather than 
the end time. In our cluster, the lastCheckpointTime of the standby node is 
approximately 80 minutes ahead of the lastCheckpointTime of the active NN.

When the checkpoint interval in standby NN exceeds 
dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the 
active NN's lastCheckpointTime is later than standby NN's, the interval is less 
than dfs.namenode.checkpoint.period, and the putting fsimage is been rejected, 
causing the checkpoint to fail and retried.

ANN's log:
{code:java}
2025-07-31 07:14:29,845 INFO [qtp231311211-8404] 
org.apache.hadoop.hdfs.server.namenode.ImageServlet: New txnid cnt is 
126487459, expecting at least 300000000. now is 1753917269845, 
lastCheckpointTime is 1753875142580, timeDelta is 42127, expecting period at 
least 43200 unless too long since last upload.. {code}
SNN's log:
{code:java}
last checkpoint start time:
2025-07-30 18:13:08,729 INFO [Standby State Checkpointer] 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering 
checkpoint because it has been 48047 seconds since the last checkpoint, which 
exceeds the configured interval 43200

last checkpoint end time:
2025-07-30 20:11:51,330 INFO [Standby State Checkpointer] 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Checkpoint 
finished successfully. 

this time checkpoint start time:
2025-07-31 06:13:51,681 INFO [Standby State Checkpointer] 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering 
checkpoint because it has been 43242 seconds since the last checkpoint, which 
exceeds the configured interval 43200{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to