[
https://issues.apache.org/jira/browse/HADOOP-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated HADOOP-10668:
-----------------------------
Attachment: HADOOP-10668.patch
It appears the check whether a node is in the right state could be the issue.
{{ZKFailoverController}} has its own {{serviceState}}. HA service such as
DummyHAService has its own state. What happened here is {{MiniZKFCCluster}}'s
{{waitForHAState}} uses DummyHAService state to decide the state has
transitioned properly. But when fencing is involved, the to-be-elected active
will directly call the old active's {{transitionToStandby}} method. Thus
{{DummyHAService}}'s state could be set to standby before
{{ZKFailoverController}}'s state is updated.
The patch didn't change the fact {{ZKFailoverController}}'s state is only
updated when it receives notification from ZK callback. So with the fix, it
might still get the following error in the log. But that is ok,
{{ZKFailoverController}}'s state eventually will be changed to standby.
{noformat}
2015-01-12 15:08:16,497 ERROR ha.ZKFailoverController
(ZKFailoverController.java:verifyChangedServiceState(828)) - Local service
DummyHAService #1 has changed the serviceState to standby. Expected was active.
Quitting election marking fencing necessary.
{noformat}
> TestZKFailoverControllerStress#testExpireBackAndForth occasionally fails
> ------------------------------------------------------------------------
>
> Key: HADOOP-10668
> URL: https://issues.apache.org/jira/browse/HADOOP-10668
> Project: Hadoop Common
> Issue Type: Test
> Components: test
> Affects Versions: 3.0.0
> Reporter: Ted Yu
> Labels: test
> Attachments: HADOOP-10668.patch
>
>
> From
> https://builds.apache.org/job/PreCommit-HADOOP-Build/4018//testReport/org.apache.hadoop.ha/TestZKFailoverControllerStress/testExpireBackAndForth/
> :
> {code}
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> at org.apache.zookeeper.server.DataTree.getData(DataTree.java:648)
> at org.apache.zookeeper.server.ZKDatabase.getData(ZKDatabase.java:371)
> at
> org.apache.hadoop.ha.MiniZKFCCluster.expireActiveLockHolder(MiniZKFCCluster.java:199)
> at
> org.apache.hadoop.ha.MiniZKFCCluster.expireAndVerifyFailover(MiniZKFCCluster.java:234)
> at
> org.apache.hadoop.ha.TestZKFailoverControllerStress.testExpireBackAndForth(TestZKFailoverControllerStress.java:84)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)