[jira] [Commented] (HDFS-17815) Fix upload fsimage failure when checkpoint takes a long time

ASF GitHub Bot (Jira) Sun, 10 Aug 2025 00:58:32 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013034#comment-18013034
 ]


ASF GitHub Bot commented on HDFS-17815:
---------------------------------------

ayushtkn commented on code in PR #7845:
URL: https://github.com/apache/hadoop/pull/7845#discussion_r2265163645


##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java:
##########
@@ -649,6 +649,50 @@ public void testCheckpointSucceedsWithLegacyOIVException() 
throws Exception {
     HATestUtil.waitForCheckpoint(cluster, 0, ImmutableList.of(12));
   }
 
+  /**
+   * Test that lastCheckpointTime is correctly updated at each checkpoint
+   */
+  @Test(timeout = 300000)
+  public void testLastCheckpointTime() throws Exception {

Review Comment:
   This test is passing with your prod change as well for me



##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java:
##########
@@ -649,6 +649,50 @@ public void testCheckpointSucceedsWithLegacyOIVException() 
throws Exception {
     HATestUtil.waitForCheckpoint(cluster, 0, ImmutableList.of(12));
   }
 
+  /**
+   * Test that lastCheckpointTime is correctly updated at each checkpoint
+   */
+  @Test(timeout = 300000)
+  public void testLastCheckpointTime() throws Exception {
+    for (int i = 1; i < NUM_NNS; i++) {
+      cluster.shutdownNameNode(i);
+
+      // Make true checkpoint for DFS_NAMENODE_CHECKPOINT_PERIOD_KEY
+      
cluster.getConfiguration(i).setInt(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_PERIOD_KEY,
 3);
+      
cluster.getConfiguration(i).setInt(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_TXNS_KEY,
 1000);
+    }
+
+    doEdits(0, 10);
+    cluster.transitionToStandby(0);
+
+    // Standby NNs do checkpoint without active NN available.
+    for (int i = 1; i < NUM_NNS; i++) {
+      cluster.restartNameNode(i, false);
+    }
+    cluster.waitClusterUp();
+
+    cluster.transitionToActive(0);
+    cluster.transitionToStandby(1);
+
+    HATestUtil.waitForCheckpoint(cluster, 1, ImmutableList.of(12));
+
+    Thread.sleep(3000);
+    Long snnCheckpointTime1 = StandbyCheckpointer.getLastCheckpointTime();
+    long annCheckpointTime1 = 
nns[0].getFSImage().getStorage().getMostRecentCheckpointTime();
+
+    doEdits(11, 20);
+    nns[0].getRpcServer().rollEditLog();
+
+    HATestUtil.waitForCheckpoint(cluster, 1, ImmutableList.of(23));
+    Thread.sleep(3000);
+    Long snnCheckpointTime2 = StandbyCheckpointer.getLastCheckpointTime();
+    long annCheckpointTime2 = 
nns[0].getFSImage().getStorage().getMostRecentCheckpointTime();
+
+    // Make sure the interv

Review Comment:
   something is missing here





> Fix upload fsimage failure when checkpoint takes a long time
> ------------------------------------------------------------
>
>                 Key: HDFS-17815
>                 URL: https://issues.apache.org/jira/browse/HDFS-17815
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.5.0
>            Reporter: caozhiqiang
>            Assignee: caozhiqiang
>            Priority: Major
>              Labels: pull-request-available
>
> The capacity of Our hdfs federation cluster are more then 500 PB, with one NS 
> containing over 600 million files. Once checkpoint takes nearly two hours.
> We discover checkpoint frequently failures due to fail to put the fsimage to 
> the active Namenode, leading to repeat checkpoints. We configured 
> dfs.recent.image.check.enabled=true. After debug, the reason is the standby 
> NN updates the lastCheckpointTime use the start time of checkpoint, rather 
> than the end time. In our cluster, the lastCheckpointTime of the standby node 
> is approximately 80 minutes ahead of the lastCheckpointTime of the active NN.
> When the checkpoint interval in standby NN exceeds 
> dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the 
> active NN's lastCheckpointTime is later than standby NN's, the interval is 
> less than dfs.namenode.checkpoint.period, and the putting fsimage is been 
> rejected, causing the checkpoint to fail and retried.
> ANN's log:
> {code:java}
> 2025-07-31 07:14:29,845 INFO [qtp231311211-8404] 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet: New txnid cnt is 
> 126487459, expecting at least 300000000. now is 1753917269845, 
> lastCheckpointTime is 1753875142580, timeDelta is 42127, expecting period at 
> least 43200 unless too long since last upload.. {code}
> SNN's log:
> {code:java}
> last checkpoint start time:
> 2025-07-30 18:13:08,729 INFO [Standby State Checkpointer] 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering 
> checkpoint because it has been 48047 seconds since the last checkpoint, which 
> exceeds the configured interval 43200
> last checkpoint end time：
> 2025-07-30 20:11:51,330 INFO [Standby State Checkpointer] 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Checkpoint 
> finished successfully. 
> this time checkpoint start time:
> 2025-07-31 06:13:51,681 INFO [Standby State Checkpointer] 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering 
> checkpoint because it has been 43242 seconds since the last checkpoint, which 
> exceeds the configured interval 43200{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17815) Fix upload fsimage failure when checkpoint takes a long time

Reply via email to