[ https://issues.apache.org/jira/browse/HDFS-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013034#comment-18013034 ]
ASF GitHub Bot commented on HDFS-17815: --------------------------------------- ayushtkn commented on code in PR #7845: URL: https://github.com/apache/hadoop/pull/7845#discussion_r2265163645 ########## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java: ########## @@ -649,6 +649,50 @@ public void testCheckpointSucceedsWithLegacyOIVException() throws Exception { HATestUtil.waitForCheckpoint(cluster, 0, ImmutableList.of(12)); } + /** + * Test that lastCheckpointTime is correctly updated at each checkpoint + */ + @Test(timeout = 300000) + public void testLastCheckpointTime() throws Exception { Review Comment: This test is passing with your prod change as well for me ########## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java: ########## @@ -649,6 +649,50 @@ public void testCheckpointSucceedsWithLegacyOIVException() throws Exception { HATestUtil.waitForCheckpoint(cluster, 0, ImmutableList.of(12)); } + /** + * Test that lastCheckpointTime is correctly updated at each checkpoint + */ + @Test(timeout = 300000) + public void testLastCheckpointTime() throws Exception { + for (int i = 1; i < NUM_NNS; i++) { + cluster.shutdownNameNode(i); + + // Make true checkpoint for DFS_NAMENODE_CHECKPOINT_PERIOD_KEY + cluster.getConfiguration(i).setInt(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_PERIOD_KEY, 3); + cluster.getConfiguration(i).setInt(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_TXNS_KEY, 1000); + } + + doEdits(0, 10); + cluster.transitionToStandby(0); + + // Standby NNs do checkpoint without active NN available. + for (int i = 1; i < NUM_NNS; i++) { + cluster.restartNameNode(i, false); + } + cluster.waitClusterUp(); + + cluster.transitionToActive(0); + cluster.transitionToStandby(1); + + HATestUtil.waitForCheckpoint(cluster, 1, ImmutableList.of(12)); + + Thread.sleep(3000); + Long snnCheckpointTime1 = StandbyCheckpointer.getLastCheckpointTime(); + long annCheckpointTime1 = nns[0].getFSImage().getStorage().getMostRecentCheckpointTime(); + + doEdits(11, 20); + nns[0].getRpcServer().rollEditLog(); + + HATestUtil.waitForCheckpoint(cluster, 1, ImmutableList.of(23)); + Thread.sleep(3000); + Long snnCheckpointTime2 = StandbyCheckpointer.getLastCheckpointTime(); + long annCheckpointTime2 = nns[0].getFSImage().getStorage().getMostRecentCheckpointTime(); + + // Make sure the interv Review Comment: something is missing here > Fix upload fsimage failure when checkpoint takes a long time > ------------------------------------------------------------ > > Key: HDFS-17815 > URL: https://issues.apache.org/jira/browse/HDFS-17815 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.5.0 > Reporter: caozhiqiang > Assignee: caozhiqiang > Priority: Major > Labels: pull-request-available > > The capacity of Our hdfs federation cluster are more then 500 PB, with one NS > containing over 600 million files. Once checkpoint takes nearly two hours. > We discover checkpoint frequently failures due to fail to put the fsimage to > the active Namenode, leading to repeat checkpoints. We configured > dfs.recent.image.check.enabled=true. After debug, the reason is the standby > NN updates the lastCheckpointTime use the start time of checkpoint, rather > than the end time. In our cluster, the lastCheckpointTime of the standby node > is approximately 80 minutes ahead of the lastCheckpointTime of the active NN. > When the checkpoint interval in standby NN exceeds > dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the > active NN's lastCheckpointTime is later than standby NN's, the interval is > less than dfs.namenode.checkpoint.period, and the putting fsimage is been > rejected, causing the checkpoint to fail and retried. > ANN's log: > {code:java} > 2025-07-31 07:14:29,845 INFO [qtp231311211-8404] > org.apache.hadoop.hdfs.server.namenode.ImageServlet: New txnid cnt is > 126487459, expecting at least 300000000. now is 1753917269845, > lastCheckpointTime is 1753875142580, timeDelta is 42127, expecting period at > least 43200 unless too long since last upload.. {code} > SNN's log: > {code:java} > last checkpoint start time: > 2025-07-30 18:13:08,729 INFO [Standby State Checkpointer] > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering > checkpoint because it has been 48047 seconds since the last checkpoint, which > exceeds the configured interval 43200 > last checkpoint end timeļ¼ > 2025-07-30 20:11:51,330 INFO [Standby State Checkpointer] > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Checkpoint > finished successfully. > this time checkpoint start time: > 2025-07-31 06:13:51,681 INFO [Standby State Checkpointer] > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering > checkpoint because it has been 43242 seconds since the last checkpoint, which > exceeds the configured interval 43200{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org