LiuGuH opened a new pull request, #8277:
URL: https://github.com/apache/hadoop/pull/8277

   Fix namenode storageDirectory errors when doCheckpoint updateStorageVersion 
failed because of doCheckpoint thread interrupted when standby namenode ha 
failover to active
   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   
   As Discribe of 
[HDFS-17886](https://issues.apache.org/jira/browse/HDFS-17886) 
   
   When namenode ha failover occurs, the standby namenode convert to active 
namenode,it will interrupt doCheckpoint thread.  There is an extremely small 
probability that doCheckpoint updateStorageVersion() will throw 
java.nio.channels.ClosedByInterruptException. It will lead to the storage 
directory errors and remove from available list.
   
   
   
   
   The relevant error log is as follows:
   
   ```
   2026-01-29 20:13:38,234 WARN org.apache.hadoop.hdfs.server.common.Storage: 
Error during write properties to the VERSION file to Storage Directory root= 
/data/hadoop/hdfs/namenode; location= null
   java.nio.channels.ClosedByInterruptException
           at 
java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
           at 
java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
           at 
java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:342)
           at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1284)
           at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1263)
           at 
org.apache.hadoop.hdfs.server.common.Storage.writeProperties(Storage.java:1254)
           at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1169)
           at 
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
           at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
           at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
   2026-01-29 20:13:38,238 ERROR org.apache.hadoop.hdfs.server.common.Storage: 
Error reported on storage directory Storage Directory root= 
/data/hadoop/hdfs/namenode; location= null
   2026-01-29 20:13:38,238 WARN org.apache.hadoop.hdfs.server.common.Storage: 
About to remove corresponding storage: /data/hadoop/hdfs/namenode
   2026-01-29 20:13:38,245 ERROR 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in 
doCheckpoint
   java.io.IOException: All the storage failed while writing properties to 
VERSION file
           at 
org.apache.hadoop.hdfs.server.namenode.NNStorage.writeAll(NNStorage.java:1175)
           at 
org.apache.hadoop.hdfs.server.namenode.FSImage.updateStorageVersion(FSImage.java:1106)
           at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1165)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:480)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:383)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:403)
           at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:503)
           at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:399)
   ```
   
   
   
   And java.nio.channels.ClosedByInterruptException is not a disk errors , so  
it should not remove from available storage list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to