[ 
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109712#comment-15109712
 ] 

Siddharth Seth commented on TEZ-3036:
-------------------------------------

bq. Eventually the starts complete, and when the main thread joins with the 
service thread that caused the error we'll log it and exit. Not the most ideal 
solution, but it works in practice. I'm definitely open to other ideas on how 
to approach this.
Thanks for the explanation. I'm fine with the patch going in as is. Could you 
please add some comments around this before committing the patch though. Also 
ServiceThread.error to volatile.

I was thinking along the lines of skipping the stateChange from AbstractService 
altogether since it appears to be broken - and implementing a custom interface 
for that, which is invoked. That isn't a simple change though. Fixing this in 
YARN itself would be ideal - with a change in stateChange behaviour / addition 
of an error notification. However, that won't be useable for a while.


> Tez AM can hang on startup with no indication of error
> ------------------------------------------------------
>
>                 Key: TEZ-3036
>                 URL: https://issues.apache.org/jira/browse/TEZ-3036
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the 
> startup sequence.  It never gets around to registering with the 
> ResourceManager, so the RM eventually times out the attempt and starts 
> another.  The subsequent attempts do the same.  There are no indications in 
> the logs that anything is wrong, rather it just seems to get stuck during 
> startup then a bit over 10 minutes later is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to