[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109283#comment-15109283
 ] 

Siddharth Seth commented on TEZ-3036:
-------------------------------------

Is this mainly caused by the way AbstractService etc handles Exceptions ? It 
just throws them back to the invoker without any kind of state notifications. 
The no notifications is consistent with the method name since the state didn't 
actually change. Maybe YARN AbstractServices need an ERROR state to handle this 
better ?

The change invokes stateChanged in case of an error. One potential issue here 
is that AbstractService catches only Exceptions - so 
"dependency.getFailureCause" would end up being null in case of an Error. 
However the ServiceThread would have caught the Throwable and invoked 
StateChanged. Is the intent to depend on the UncaughExceptionHandler for this ?

Not related to this patch: ServiceThread.error needs to be volatile.

> Tez AM can hang on startup with no indication of error
> ------------------------------------------------------
>
>                 Key: TEZ-3036
>                 URL: https://issues.apache.org/jira/browse/TEZ-3036
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to