[ https://issues.apache.org/jira/browse/HDFS-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567653#comment-16567653 ]
maobaolong commented on HDFS-13269: ----------------------------------- It is indeed a to be improve item. > After too many open file exception occurred, the standby NN never do > checkpoint > ------------------------------------------------------------------------------- > > Key: HDFS-13269 > URL: https://issues.apache.org/jira/browse/HDFS-13269 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 3.2.0 > Reporter: maobaolong > Priority: Major > > do saveNameSpace in dfsadmin. > The output as following: > > {code:java} > saveNamespace: No image directories available! > {code} > The Namenode log show: > > > {code:java} > [2018-01-13T10:32:19.903+08:00] [INFO] [Standby State Checkpointer] : > Triggering checkpoint because there have been 10159265 txns since the last > checkpoint, which exceeds the configured threshold 10000000 > [2018-01-13T10:32:19.903+08:00] [INFO] [Standby State Checkpointer] : Save > namespace ... > ... > [2018-01-13T10:37:10.539+08:00] [WARN] [1985938863@qtp-61073295-1 - Acceptor0 > HttpServer2$SelectChannelConnectorWithSafeStartup@HOST_A:50070] : EXCEPTION > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.mortbay.jetty.nio.SelectChannelConnector$1.acceptChannel(SelectChannelConnector.java:75) > at > org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:686) > at > org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192) > at > org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124) > at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > [2018-01-13T10:37:15.421+08:00] [ERROR] [FSImageSaver for /data0/nn of type > IMAGE_AND_EDITS] : Unable to save image for /data0/nn > java.io.FileNotFoundException: > /data0/nn/current/fsimage_0000000040247283317.md5.tmp (Too many open files) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.<init>(FileOutputStream.java:213) > at java.io.FileOutputStream.<init>(FileOutputStream.java:162) > at > org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58) > at > org.apache.hadoop.hdfs.util.MD5FileUtils.saveMD5File(MD5FileUtils.java:157) > at > org.apache.hadoop.hdfs.util.MD5FileUtils.saveMD5File(MD5FileUtils.java:149) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:990) > at > org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1039) > at java.lang.Thread.run(Thread.java:745) > [2018-01-13T10:37:15.421+08:00] [ERROR] [Standby State Checkpointer] : Error > reported on storage directory Storage Directory /data0/nn > [2018-01-13T10:37:15.421+08:00] [WARN] [Standby State Checkpointer] : About > to remove corresponding storage: /data0/nn > [2018-01-13T10:37:15.429+08:00] [ERROR] [Standby State Checkpointer] : > Exception in doCheckpoint > java.io.IOException: Failed to save in any storage directories while saving > namespace. > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1176) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1107) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:185) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276) > ... > [2018-01-13T15:52:33.783+08:00] [INFO] [Standby State Checkpointer] : Save > namespace ... > [2018-01-13T15:52:33.783+08:00] [ERROR] [Standby State Checkpointer] : > Exception in doCheckpoint > java.io.IOException: No image directories available! > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1152) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1107) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:185) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276) > {code} > After seeing the NNStorage#reportErrorsOnDirectory > {code:java} > private void reportErrorsOnDirectory(StorageDirectory sd) { > LOG.error("Error reported on storage directory {}", sd); > if(LOG.isDebugEnabled()){ > String lsd = listStorageDirectories(); > LOG.debug("current list of storage dirs:{}", lsd); > } > LOG.warn("About to remove corresponding storage: {}", sd.getRoot() > .getAbsolutePath()); > try { > sd.unlock(); > } catch (Exception e) { > LOG.warn("Unable to unlock bad storage directory: {}", sd.getRoot() > .getPath(), e); > } > if (getStorageDirs().remove(sd)) { > this.removedStorageDirs.add(sd); > } > if(LOG.isDebugEnabled()){ > String lsd = listStorageDirectories(); > LOG.debug("at the end current list of storage dirs:{}", lsd); > } > } > {code} > I think when the FileNotFoundException(Too many open file) occurred, we > should not remove the storageDirs, because the exception can be cure after > several minute. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org