Michael Prim created TEZ-3435:
---------------------------------
Summary: WebUIService thread tries to use blacklisted disk, dies,
and kills AM
Key: TEZ-3435
URL: https://issues.apache.org/jira/browse/TEZ-3435
Project: Apache Tez
Issue Type: Bug
Components: UI
Affects Versions: 0.8.4
Reporter: Michael Prim
Priority: Critical
We recently hit an issue that certain TEZ jobs died when scheduled on a node
that had a broken disk. The disk was already marked as broken and excluded by
YARN node manager. Other applications worked fine on that node, only TEZ jobs
died.
The error where ClassNotFound exceptions, of basic hadoop classes, which should
be available everywhere. After some investigation we found out that the
WebUIService thread, spawned by the AM tries to utilize that broken disk. See
stacktrace, disk3 was excluded by node manager.
{code}
[WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService] |mortbay.log|:
Failed to read file:
/volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:219)
at java.util.zip.ZipFile.<init>(ZipFile.java:149)
at java.util.jar.JarFile.<init>(JarFile.java:166)
at java.util.jar.JarFile.<init>(JarFile.java:130)
at
org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
at
org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827)
at
org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848)
{code}
Which did lead to the ClassNotFound exceptions and killing the AM. Interesting
enough the DAGAppMaster was aware of this broken disk and did exclude it from
the localDirs. It contains only the remaining disks of the node.
{code}
[INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for
applicationId=application_1472223062609_42648, attemptNum=1,
AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538,
userFromEnv=muhammad, cliSessionOption=true,
pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001,
localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,
logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001
{code}
Actually this is quite an issue as in a huge data center you always have some
broken disks and by chance your AM may scheduled on one of this nodes.
Summary: From my point of view it looks like as if the WebUIService thread does
somehow not properly take into account the local directories that are excluded
by the node manager.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)