[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102478#comment-15102478
 ] 

Jason Lowe commented on TEZ-3036:
---------------------------------

My apologies, I misread the heap dump info.  NoSuchMethodError was being 
propagated up rather than NoSuchMethodException (which is the cause of the 
error).  The issue occurs with no notification when an error is thrown rather 
than an exception.  If an exception is thrown then it will log it but I think 
it will still hang.

Here's the relevant portions of the stacktrace when this occurs:
{noformat}
"ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler" #34 prio=5 
os_prio=0 tid=0x00007f2be161c000 nid=0x65a4 in Object.wait() 
[0x00007f2bb87b5000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f55ddd58> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency)
        at 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1655)
        - locked <0x00000000f55ddd58> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency)
        at 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1693)

 
"main" #1 prio=5 os_prio=0 tid=0x00007f2be0019800 nid=0x653b in Object.wait() 
[0x00007f2be57e1000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000f6857998> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread)
        at java.lang.Thread.join(Thread.java:1245)
        - locked <0x00000000f6857998> (a 
org.apache.tez.dag.app.DAGAppMaster$ServiceThread)
        at java.lang.Thread.join(Thread.java:1319)
        at 
org.apache.tez.dag.app.DAGAppMaster.startServices(DAGAppMaster.java:1730)
        at 
org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1799)
        - locked <0x00000000a0326928> (a org.apache.tez.dag.app.DAGAppMaster)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        - locked <0x00000000a0326ae8> (a java.lang.Object)
        at org.apache.tez.dag.app.DAGAppMaster$6.run(DAGAppMaster.java:2369)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
        at 
org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2365)
        at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:2173)
{noformat}


> Tez AM can hang on startup with no indication of error
> ------------------------------------------------------
>
>                 Key: TEZ-3036
>                 URL: https://issues.apache.org/jira/browse/TEZ-3036
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to