[
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581590#comment-14581590
]
Vinayakumar B commented on HADOOP-10251:
----------------------------------------
Which version of Hadoop You are using?
Because I can see below logs (excluded DEBUG),
{noformat}2015-06-10 02:57:56,073 INFO
org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode
at zdh195/10.43.156.195:9000 to active state
2015-06-10 02:57:56,092 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully became active. Successfully transitioned NameNode at
zdh195/10.43.156.195:9000 to active state
2015-06-10 02:57:57,082 ERROR org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to
active. Expected was standby. Quitting election marking fencing
necessary.{noformat}
Immediately after {{becomeActive()}}, ERROR log is showing state expected is
{{standby}}. {{serviceState}} is changed to {{active}} in {{becomeActive()}}
immediately after above log.
IMO, this is possible only if {{volatile}} is not present while declaring
{{serviceState}}
{code}private volatile HAServiceState serviceState =
HAServiceState.INITIALIZING;{code}
do you have this in your code?
> Both NameNodes could be in STANDBY State if SNN network is unstable
> -------------------------------------------------------------------
>
> Key: HADOOP-10251
> URL: https://issues.apache.org/jira/browse/HADOOP-10251
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.2.0
> Reporter: Vinayakumar B
> Assignee: Vinayakumar B
> Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HADOOP-10251.patch, HADOOP-10251.patch,
> HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch
>
>
> Following corner scenario happened in one of our cluster.
> 1. NN1 was Active and NN2 was Standby
> 2. NN2 machine's network was slow
> 3. NN1 got shutdown.
> 4. NN2 ZKFC got the notification and trying to check for old active for
> fencing. (This took little more time, again due to slow network)
> 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made
> it Active.
> 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to
> STANBY.
> 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and
> got shutdown before making NN2 Active.
> *Now cluster having both NameNodes as STANDBY.*
> NN1 ZKFC still thinks that its nameNode is in Active state.
> NN2 ZKFC waiting for election.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)