[
https://issues.apache.org/jira/browse/TEZ-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493600#comment-15493600
]
Michael Prim commented on TEZ-3435:
-----------------------------------
[~hitesh] we are running hadoop 2.6.0, but as part of a Cloudera installation,
so there are some patches on top, compared to a vanilla pure 2.6.0 hadoop
installation.
The release notes can be found at:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.1.releasenotes.html
and this is the tarball of code
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.1.tar.gz
> WebUIService thread tries to use blacklisted disk, dies, and kills AM
> ---------------------------------------------------------------------
>
> Key: TEZ-3435
> URL: https://issues.apache.org/jira/browse/TEZ-3435
> Project: Apache Tez
> Issue Type: Bug
> Components: UI
> Affects Versions: 0.8.4
> Reporter: Michael Prim
> Priority: Critical
>
> We recently hit an issue that certain TEZ jobs died when scheduled on a node
> that had a broken disk. The disk was already marked as broken and excluded by
> YARN node manager. Other applications worked fine on that node, only TEZ jobs
> died.
> The error where ClassNotFound exceptions, of basic hadoop classes, which
> should be available everywhere. After some investigation we found out that
> the WebUIService thread, spawned by the AM tries to utilize that broken disk.
> See stacktrace, disk3 was excluded by node manager.
> {code}
> [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService]
> |mortbay.log|: Failed to read file:
> /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
> java.util.zip.ZipException: error in opening zip file
> at java.util.zip.ZipFile.open(Native Method)
> at java.util.zip.ZipFile.<init>(ZipFile.java:219)
> at java.util.zip.ZipFile.<init>(ZipFile.java:149)
> at java.util.jar.JarFile.<init>(JarFile.java:166)
> at java.util.jar.JarFile.<init>(JarFile.java:130)
> at
> org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
> at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
> at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
> at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
> at org.mortbay.jetty.Server.doStart(Server.java:224)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
> at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
> at
> org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
> at
> org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
> {code}
> Which did lead to the ClassNotFound exceptions and killing the AM.
> Interesting enough the DAGAppMaster was aware of this broken disk and did
> exclude it from the localDirs. It contains only the remaining disks of the
> node.
> {code}
> [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for
> applicationId=application_1472223062609_42648, attemptNum=1,
> AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538,
> userFromEnv=muhammad, cliSessionOption=true,
> pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001,
>
> localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,
>
> logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
> {code}
> Actually this is quite an issue as in a huge data center you always have some
> broken disks and by chance your AM may scheduled on one of this nodes.
> Summary: From my point of view it looks like as if the WebUIService thread
> does somehow not properly take into account the local directories that are
> excluded by the node manager.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)