[
https://issues.apache.org/jira/browse/TEZ-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109623#comment-15109623
]
Jason Lowe commented on TEZ-3036:
---------------------------------
Yes, it's an artifact of the AbstractService exception handling. The state
goes to STARTED before the service is started, so it's still in the STARTED
state when it fails.
bq. Is the intent to depend on the UncaughExceptionHandler for this ?
No, if we rely on the UncaughtExceptionHandler then it still deadlocks per my
previous comment. The shutdown hook gets invoked, and then that hangs waiting
for a lock that's held while we're still starting services. Bubbling up the
uncaught exception will hang the startup (as seen here), so we're deadlocked.
This patch "works" because it allows the services to make progress. Even
though the failure cause is unset when a throwable occurs, the state is still
STARTED so it convinces the dependent services to start as well. Eventually
the starts complete, and when the main thread joins with the service thread
that caused the error we'll log it and exit. Not the most ideal solution, but
it works in practice. I'm definitely open to other ideas on how to approach
this.
> Tez AM can hang on startup with no indication of error
> ------------------------------------------------------
>
> Key: TEZ-3036
> URL: https://issues.apache.org/jira/browse/TEZ-3036
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: TEZ-3036.001.patch
>
>
> I've seen a couple of instances where the Tez AM fails to complete the
> startup sequence. It never gets around to registering with the
> ResourceManager, so the RM eventually times out the attempt and starts
> another. The subsequent attempts do the same. There are no indications in
> the logs that anything is wrong, rather it just seems to get stuck during
> startup then a bit over 10 minutes later is killed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)