[
https://issues.apache.org/jira/browse/TEZ-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493567#comment-15493567
]
Hitesh Shah commented on TEZ-3435:
----------------------------------
Thanks for reporting the issue [~mprim]. Would you mind letting us know if you
are running 2.6.0 version of hadoop or a different version?
To provide more context on this issue:
- the tarball configured in tez.lib.uris is downloaded to one of the disks
in yarn local dirs on the nodemanager
- it is uncompressed into a dir and a symlink is created in the container's
working dir to the dir where it was uncompressed ( the dir being in one of the
yarn local dirs ).
- the container's working dir is added to tez's classpath.
The problem in this scenario seems to be that even though the YARN NM
blacklisted the disk, it still end up using local resources downloaded to that
disk and launching a container with those resources instead of trying to
download them onto a new disk and launching the container with those resources.
\cc [~wangda] [~djp] [~jlowe] in case they are aware of any YARN jiras that
call out this issue or if has already been fixed in a later release.
> WebUIService thread tries to use blacklisted disk, dies, and kills AM
> ---------------------------------------------------------------------
>
> Key: TEZ-3435
> URL: https://issues.apache.org/jira/browse/TEZ-3435
> Project: Apache Tez
> Issue Type: Bug
> Components: UI
> Affects Versions: 0.8.4
> Reporter: Michael Prim
> Priority: Critical
>
> We recently hit an issue that certain TEZ jobs died when scheduled on a node
> that had a broken disk. The disk was already marked as broken and excluded by
> YARN node manager. Other applications worked fine on that node, only TEZ jobs
> died.
> The error where ClassNotFound exceptions, of basic hadoop classes, which
> should be available everywhere. After some investigation we found out that
> the WebUIService thread, spawned by the AM tries to utilize that broken disk.
> See stacktrace, disk3 was excluded by node manager.
> {code}
> [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService]
> |mortbay.log|: Failed to read file:
> /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
> java.util.zip.ZipException: error in opening zip file
> at java.util.zip.ZipFile.open(Native Method)
> at java.util.zip.ZipFile.<init>(ZipFile.java:219)
> at java.util.zip.ZipFile.<init>(ZipFile.java:149)
> at java.util.jar.JarFile.<init>(JarFile.java:166)
> at java.util.jar.JarFile.<init>(JarFile.java:130)
> at
> org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
> at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
> at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
> at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
> at org.mortbay.jetty.Server.doStart(Server.java:224)
> at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
> at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
> at
> org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
> at
> org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
> {code}
> Which did lead to the ClassNotFound exceptions and killing the AM.
> Interesting enough the DAGAppMaster was aware of this broken disk and did
> exclude it from the localDirs. It contains only the remaining disks of the
> node.
> {code}
> [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for
> applicationId=application_1472223062609_42648, attemptNum=1,
> AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538,
> userFromEnv=muhammad, cliSessionOption=true,
> pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001,
>
> localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,
>
> logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
> {code}
> Actually this is quite an issue as in a huge data center you always have some
> broken disks and by chance your AM may scheduled on one of this nodes.
> Summary: From my point of view it looks like as if the WebUIService thread
> does somehow not properly take into account the local directories that are
> excluded by the node manager.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)