[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097107#comment-15097107
 ] 

Jason Lowe commented on TEZ-3036:
---------------------------------

In this particular instance the hang occurred because TaskSchedulerEventHandler 
was still wating on its dependencies to finish before it could start, but one 
of the dependencies failed without any log message, backtrace, or any 
indication there was an issue.  There was a ClassNotFoundException thrown 
during the WebUIService startup due to a conflict with one of the jars provided 
by the user.  When the error bubbled up through AbstractService it looks like 
the state change listener callbacks were skipped.  The exception was caught by 
ServiceThread but then silently stored away into a local, suppressing the error.

I think ServiceWithDependency relies on the fact that stateChanged will be 
called when one of the dependencies failed.  Since that wasn't called, 
TaskSchedulerEventHandler hung around waiting for the dependencies to finish 
without realizing they already were finished.  The error stored in 
ServiceThread was never shown because it isn't logged until all ServiceThreads 
have been joined.  Since one of the threads was mistakenly waiting on an 
already failed service, the join never completes and the error will never be 
logged.  End result is a hang with no indication of any error.

> Tez AM can hang on startup with no indication of error
> ------------------------------------------------------
>
>                 Key: TEZ-3036
>                 URL: https://issues.apache.org/jira/browse/TEZ-3036
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to