[ https://issues.apache.org/jira/browse/TEZ-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hitesh Shah resolved TEZ-3435. ------------------------------ Resolution: Not A Bug Resolving this based on the discussion. Thanks for reporting the issue [~mprim]. > WebUIService thread tries to use blacklisted disk, dies, and kills AM > --------------------------------------------------------------------- > > Key: TEZ-3435 > URL: https://issues.apache.org/jira/browse/TEZ-3435 > Project: Apache Tez > Issue Type: Bug > Components: UI > Affects Versions: 0.8.4 > Reporter: Michael Prim > Priority: Critical > > We recently hit an issue that certain TEZ jobs died when scheduled on a node > that had a broken disk. The disk was already marked as broken and excluded by > YARN node manager. Other applications worked fine on that node, only TEZ jobs > died. > The error where ClassNotFound exceptions, of basic hadoop classes, which > should be available everywhere. After some investigation we found out that > the WebUIService thread, spawned by the AM tries to utilize that broken disk. > See stacktrace, disk3 was excluded by node manager. > {code} > [WARN] [ServiceThread:org.apache.tez.dag.app.web.WebUIService] > |mortbay.log|: Failed to read file: > /volumes/disk3/yarn/nm/filecache/9017/hadoop-mapreduce-client-core-2.6.0.jar > java.util.zip.ZipException: error in opening zip file > at java.util.zip.ZipFile.open(Native Method) > at java.util.zip.ZipFile.<init>(ZipFile.java:219) > at java.util.zip.ZipFile.<init>(ZipFile.java:149) > at java.util.jar.JarFile.<init>(JarFile.java:166) > at java.util.jar.JarFile.<init>(JarFile.java:130) > at > org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:174) > at > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1279) > at > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) > at > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) > at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) > at > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) > at > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) > at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) > at > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) > at org.mortbay.jetty.Server.doStart(Server.java:224) > at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) > at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:900) > at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273) > at > org.apache.tez.dag.app.web.WebUIService.serviceStart(WebUIService.java:94) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1827) > at > org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1848) > {code} > Which did lead to the ClassNotFound exceptions and killing the AM. > Interesting enough the DAGAppMaster was aware of this broken disk and did > exclude it from the localDirs. It contains only the remaining disks of the > node. > {code} > [INFO] [main] |app.DAGAppMaster|: Creating DAGAppMaster for > applicationId=application_1472223062609_42648, attemptNum=1, > AMContainerId=container_1472223062609_42648_01_000001, jvmPid=2538, > userFromEnv=muhammad, cliSessionOption=true, > pwd=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648/container_1472223062609_42648_01_000001, > > localDirs=/volumes/disk1/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk10/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk4/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk5/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk6/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk7/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk8/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648,/volumes/disk9/yarn/nm/usercache/muhammad/appcache/application_1472223062609_42648, > > logDirs=/var/log/hadoop-yarn/container/application_1472223062609_42648/container_1472223062609_42648_01_000001 > {code} > Actually this is quite an issue as in a huge data center you always have some > broken disks and by chance your AM may scheduled on one of this nodes. > Summary: From my point of view it looks like as if the WebUIService thread > does somehow not properly take into account the local directories that are > excluded by the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)