[ 
https://issues.apache.org/jira/browse/HDFS-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567653#comment-16567653
 ] 

maobaolong commented on HDFS-13269:
-----------------------------------

It is indeed a to be improve item.

> After too many open file exception occurred, the standby NN never do 
> checkpoint
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-13269
>                 URL: https://issues.apache.org/jira/browse/HDFS-13269
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.2.0
>            Reporter: maobaolong
>            Priority: Major
>
> do saveNameSpace in dfsadmin.
> The output as following:
>  
> {code:java}
> saveNamespace: No image directories available!
> {code}
> The Namenode log show:
>  
>  
> {code:java}
> [2018-01-13T10:32:19.903+08:00] [INFO] [Standby State Checkpointer] : 
> Triggering checkpoint because there have been 10159265 txns since the last 
> checkpoint, which exceeds the configured threshold 10000000
> [2018-01-13T10:32:19.903+08:00] [INFO] [Standby State Checkpointer] : Save 
> namespace ...
> ...
> [2018-01-13T10:37:10.539+08:00] [WARN] [1985938863@qtp-61073295-1 - Acceptor0 
> HttpServer2$SelectChannelConnectorWithSafeStartup@HOST_A:50070] : EXCEPTION 
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at 
> org.mortbay.jetty.nio.SelectChannelConnector$1.acceptChannel(SelectChannelConnector.java:75)
>         at 
> org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:686)
>         at 
> org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)
>         at 
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
>         at 
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
>         at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> [2018-01-13T10:37:15.421+08:00] [ERROR] [FSImageSaver for /data0/nn of type 
> IMAGE_AND_EDITS] : Unable to save image for /data0/nn
> java.io.FileNotFoundException: 
> /data0/nn/current/fsimage_0000000040247283317.md5.tmp (Too many open files)
>         at java.io.FileOutputStream.open0(Native Method)
>         at java.io.FileOutputStream.open(FileOutputStream.java:270)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>         at 
> org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
>         at 
> org.apache.hadoop.hdfs.util.MD5FileUtils.saveMD5File(MD5FileUtils.java:157)
>         at 
> org.apache.hadoop.hdfs.util.MD5FileUtils.saveMD5File(MD5FileUtils.java:149)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:990)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1039)
>         at java.lang.Thread.run(Thread.java:745)
> [2018-01-13T10:37:15.421+08:00] [ERROR] [Standby State Checkpointer] : Error 
> reported on storage directory Storage Directory /data0/nn
> [2018-01-13T10:37:15.421+08:00] [WARN] [Standby State Checkpointer] : About 
> to remove corresponding storage: /data0/nn
> [2018-01-13T10:37:15.429+08:00] [ERROR] [Standby State Checkpointer] : 
> Exception in doCheckpoint
> java.io.IOException: Failed to save in any storage directories while saving 
> namespace.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1176)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:185)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276)
> ...
> [2018-01-13T15:52:33.783+08:00] [INFO] [Standby State Checkpointer] : Save 
> namespace ...
> [2018-01-13T15:52:33.783+08:00] [ERROR] [Standby State Checkpointer] : 
> Exception in doCheckpoint
> java.io.IOException: No image directories available!
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:185)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:353)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$700(StandbyCheckpointer.java:260)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:280)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:276)
> {code}
> After seeing the NNStorage#reportErrorsOnDirectory
> {code:java}
> private void reportErrorsOnDirectory(StorageDirectory sd) {
>   LOG.error("Error reported on storage directory {}", sd);
>   if(LOG.isDebugEnabled()){
>     String lsd = listStorageDirectories();
>     LOG.debug("current list of storage dirs:{}", lsd);
>   }
>   LOG.warn("About to remove corresponding storage: {}", sd.getRoot()
>       .getAbsolutePath());
>   try {
>     sd.unlock();
>   } catch (Exception e) {
>     LOG.warn("Unable to unlock bad storage directory: {}", sd.getRoot()
>         .getPath(), e);
>   }
>   if (getStorageDirs().remove(sd)) {
>     this.removedStorageDirs.add(sd);
>   }
>   if(LOG.isDebugEnabled()){
>     String lsd = listStorageDirectories();
>     LOG.debug("at the end current list of storage dirs:{}", lsd);
>   }
> }
> {code}
> I think when the FileNotFoundException(Too many open file) occurred, we 
> should not remove the storageDirs, because the exception can be cure after 
> several minute.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to