[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286869#comment-13286869
 ] 

Jason Lowe commented on MAPREDUCE-4298:
---------------------------------------

The relevant portion of the NM logfile:

{noformat}
 [New I/O server worker #1-2]2012-05-25 21:30:36,719 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
 not found
 [New I/O server worker #1-15]2012-05-25 21:30:36,719 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
 not found
 [New I/O server worker #1-23]2012-05-25 21:30:36,719 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
 not found
 [New I/O server worker #1-2]2012-05-25 21:30:36,720 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out
 not found
 [New I/O server worker #1-17]2012-05-25 21:30:36,720 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000833_0/file.out
 not found
 [New I/O server worker #1-24]2012-05-25 21:30:36,721 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/10/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000030_0/file.out
 not found
 [New I/O server worker #1-30]2012-05-25 21:30:36,722 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/1/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000314_0/file.out
 not found
 [New I/O server worker #1-18]2012-05-25 21:30:36,722 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container container_1337019383242_29334_01_000148 succeeded 
 [ContainersLauncher #2677]2012-05-25 21:30:36,722 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1337019383242_29334_01_000148 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
 [AsyncDispatcher event handler]2012-05-25 21:30:36,722 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1337019383242_29334_01_000148
 [AsyncDispatcher event handler]2012-05-25 21:30:36,727 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/9/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000831_0/file.out
 not found
 [New I/O server worker #1-7]2012-05-25 21:30:36,727 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out
 not found
 [New I/O server worker #1-27]2012-05-25 21:30:36,728 INFO 
org.apache.hadoop.mapred.ShuffleHandler: 
/grid/0/tmp/yarn-local/usercache/xx/appcache/application_1337019383242_29334/output/attempt_1337019383242_29334_m_000263_0/file.out
 not found
 [New I/O server worker #1-13]2012-05-25 21:30:36,728 FATAL 
org.apache.hadoop.conf.Configuration: error parsing conf file: 
java.io.FileNotFoundException: 
/grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml
 (Too many open files)
  [AsyncDispatcher event handler]2012-05-25 21:30:36,728 FATAL 
org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
 [AsyncDispatcher event handler]java.lang.RuntimeException: 
java.io.FileNotFoundException: 
/grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml
 (Too many open files)
        at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1723)
        at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1588)
        at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1534)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:581)
        at 
org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:598)
        at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1332)
        at 
org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:142)
        at 
org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:233)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:315)
        at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:313)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at 
org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:313)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:426)
        at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:406)
        at 
org.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:392)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:353)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:144)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:54)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: 
/grid/0/Releases/conf-xx-0.23.3.1204240816-20120514-000/hadoop/datanode/core-site.xml
 (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:106)
        at java.io.FileInputStream.<init>(FileInputStream.java:66)
        at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70)
        at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161)
        at 
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653)
        at 
com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
        at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180)
        at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1637)
        ... 22 more
2012-05-25 21:30:36,729 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Exiting, bbye..
{noformat}

It's not clear yet exactly why the NM hit the open file ulimit which was set to 
8192.  There were jobs running on the cluster with 4000+ reducers launched 
(potentially multiple of these types of jobs simultaneously).

I checked the ShuffleHandler which uses Netty, and I didn't see any limit to 
the number of connections Netty would accept.  I wrote a quick Python script 
that simply connects to the shuffle port as fast as it can, keeping around all 
the sockets, and shortly afterwards the Netty boss thread had crashed with this 
in the NM logfile:

{noformat}
Exception in thread "ShuffleHandler Netty Boss #0" java.lang.InternalError
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:755)
        at sun.misc.URLClassPath.getResource(URLClassPath.java:169)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:295)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at 
java.util.ResourceBundle$RBClassLoader.loadClass(ResourceBundle.java:435)
        at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
        at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
        at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
        at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
        at java.util.ResourceBundle.getBundle(ResourceBundle.java:705)
        at java.util.logging.Level.getLocalizedName(Level.java:223)
        at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:64)
        at java.util.logging.StreamHandler.publish(StreamHandler.java:179)
        at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88)
        at java.util.logging.Logger.log(Logger.java:481)
        at java.util.logging.Logger.doLog(Logger.java:503)
        at java.util.logging.Logger.logp(Logger.java:703)
        at org.jboss.netty.logging.JdkLogger.warn(JdkLogger.java:86)
        at 
org.jboss.netty.logging.InternalLoggerFactory$1.warn(InternalLoggerFactory.java:133)
        at 
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:260)
        at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at 
org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.zip.ZipException: error in opening zip file
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:127)
        at java.util.jar.JarFile.<init>(JarFile.java:135)
        at java.util.jar.JarFile.<init>(JarFile.java:72)
        at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:646)
        at sun.misc.URLClassPath$JarLoader.access$600(URLClassPath.java:540)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:607)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:599)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:753)
        ... 31 more
{noformat}

So in this case, Netty probably got an EMFILE error when trying to accept a new 
connection after hitting the socket limit, and when it tried to log a warning 
it ended up invoking the class loader which was unable to get a new file 
descriptor for the jar file.

Given there were a lot of reducers running, in theory these could all hit a 
single nodemanager near the same time when waiting for the last map to finish.  
If for some reason the reducer shuffle connections lingered around long enough, 
we'd exhaust the file descriptors on the nodemanager given enough reducers 
within a short enough time span.  I can't tell from the logs if this is exactly 
what happened in this case, but it is concerning that there appears to be no 
limit to the number of connections the shuffle handler will accept before we 
hit hard OS limits that will crash the NM in various ways.
                
> NodeManager crashed after running out of file descriptors
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-4298
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4298
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> A node on one of our clusters fell over because it ran out of open file 
> descriptors.  Log details with stack traceback to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to