[jira] [Commented] (TEZ-3435) WebUIService thread tries to use blacklisted disk, dies, and kills AM

Michael Prim (JIRA) Thu, 15 Sep 2016 08:11:46 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493600#comment-15493600
 ]


Michael Prim commented on TEZ-3435:
-----------------------------------

[~hitesh] we are running hadoop 2.6.0, but as part of a Cloudera installation, 
so there are some patches on top, compared to a vanilla pure 2.6.0 hadoop 
installation.

The release notes can be found at: 
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.1.releasenotes.html 
and this is the tarball of code 
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.1.tar.gz



> WebUIService thread tries to use blacklisted disk, dies, and kills AM
> ---------------------------------------------------------------------
>
>                 Key: TEZ-3435
>                 URL: https://issues.apache.org/jira/browse/TEZ-3435
>             Project: Apache Tez
>          Issue Type: Bug
>          Components: UI
>    Affects Versions: 0.8.4
>            Reporter: Michael Prim
>            Priority: Critical
>
> We recently hit an issue that certain TEZ jobs died when scheduled on a node 
> that had a broken disk. The disk was already marked as broken and excluded by 
> YARN node manager. Other applications worked fine on that node, only TEZ jobs 
> died.
> The error where ClassNotFound exceptions, of basic hadoop classes, which 
> should be available everywhere. After some investigation we found out that 
> the WebUIService thread, spawned by the AM tries to utilize that broken disk. 
> See stacktrace, disk3 was excluded by node manager.
> {code}
>  [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService] 
> |mortbay.log|: Failed to read file: 
> /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
> java.util.zip.ZipException: error in opening zip file
>       at java.util.zip.ZipFile.open(Native Method)
>       at java.util.zip.ZipFile.<init>(ZipFile.java:219)
>       at java.util.zip.ZipFile.<init>(ZipFile.java:149)
>       at java.util.jar.JarFile.<init>(JarFile.java:166)
>       at java.util.jar.JarFile.<init>(JarFile.java:130)
>       at 
> org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
>       at 
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
>       at 
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
>       at 
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
>       at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>       at 
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
>       at 
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
>       at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>       at 
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
>       at org.mortbay.jetty.Server.doStart(Server.java:224)
>       at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
>       at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
>       at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
>       at 
> org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
> {code}
> Which did lead to the ClassNotFound exceptions and killing the AM. 
> Interesting enough the DAGAppMaster was aware of this broken disk and did 
> exclude it from the localDirs. It contains only the remaining disks of the 
> node.
> {code}
> [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for 
> applicationId=application_1472223062609_42648, attemptNum=1, 
> AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538, 
> userFromEnv=muhammad, cliSessionOption=true, 
> pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001,
>  
> localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,
>  
> logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
> {code}
> Actually this is quite an issue as in a huge data center you always have some 
> broken disks and by chance your AM may scheduled on one of this nodes.
> Summary: From my point of view it looks like as if the WebUIService thread 
> does somehow not properly take into account the local directories that are 
> excluded by the node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3435) WebUIService thread tries to use blacklisted disk, dies, and kills AM

Reply via email to