[ 
https://issues.apache.org/jira/browse/HADOOP-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245991#comment-13245991
 ] 

Todd Lipcon commented on HADOOP-8245:
-------------------------------------

For problem #1, the solution is the same as is already done in some other test 
cases. We just need to add a workaround to clear the ZK MBeans before running 
the tearDown method. It's a hack, but in the absense of a fix for 
ZOOKEEPER-1438, it's about all we can do.

I spent some time investigating problem #2. The bug is as follows:
- these test cases create a new ActiveStandbyElector, and call 
{{ActiveStandbyElector.ensureBaseNode()}} on it before running the main body of 
the tests. Although they don't call {{joinElection()}}, the creation of the 
elector does create a {{zkClient}} object with an associated Watcher.
- in the {{testZookeeperFailure}} test case, we shut down and restart ZK. This 
causes the above Watcher instance to fire its Disconnected and then Connected 
events. There was a bug in the handling of the Connected event that would cause 
it to re-monitor the lock znode regardless of whether it was previously in the 
election.
- So, when ZK comes back up, there was not two but *three* electors racing for 
the lock. However, two of the electors actually corresponded to the same dummy 
service. In some cases this race would be resolved in such a way that the test 
timed out.

I don't think this is a problem in practice, since the "formatZK" call runs in 
its own JVM in the current code. However, it's worth fixing to get the tests to 
not be flaky, and to have a more reasonable behavior. There are several fixes 
to be done:
- Add extra asserts for {{wantToBeInElection}} to catch cases where we might 
accidentally re-join the election when we weren't supposed to be in it.
- Fix the handling of the "Connected" event to only re-join if the elector 
wants to be in the election
- Cause exceptions thrown by watcher callbacks to be propagated back as fatal 
errors

Will post a patch momentarily.

                
> Fix flakiness in TestZKFailoverController
> -----------------------------------------
>
>                 Key: HADOOP-8245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8245
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Minor
>
> When I loop TestZKFailoverController, I occasionally see two types of 
> failures:
> 1) the ZK JMXEnv issue (ZOOKEEPER-1438)
> 2) TestZKFailoverController.testZooKeeperFailure fails with a timeout
> This JIRA is for fixes for these issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to