[
https://issues.apache.org/jira/browse/HADOOP-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881687#comment-15881687
]
Daniel Templeton commented on HADOOP-10584:
-------------------------------------------
Resetting the counts isn't the answer. I can now reproduce this issue reliably
by setting a break point in {{processWatchEvent()}} and shutting down ZK before
continuing. The issue is a race condition between the events from the ZK
client and creating/statting the ZK node. If the disconnected update event
comes first, all is well. If not, it will retry a few times and then fail the
RM.
To echo earlier comments, why does ZK connection loss necessitate stopping the
RM in this case? It doesn't in any other case. My proposal would be to remove
the fatal error completely. We could instead either transition to standby
explicitly or just ignore the error (and hence the retries) on connection loss
and wait for the ZK event to trigger the transition. I kinda like the latter.
Any opinion?
> ActiveStandbyElector goes down if ZK quorum become unavailable
> --------------------------------------------------------------
>
> Key: HADOOP-10584
> URL: https://issues.apache.org/jira/browse/HADOOP-10584
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Priority: Critical
> Attachments: hadoop-10584-prelim.patch, rm.log
>
>
> ActiveStandbyElector retries operations for a few times. If the ZK quorum
> itself is down, it goes down and the daemons will have to be brought up
> again.
> Instead, it should log the fact that it is unable to talk to ZK, call
> becomeStandby on its client, and continue to attempt connecting to ZK.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]