GitHub user hanm opened a pull request:

    https://github.com/apache/zookeeper/pull/315

    ZOOKEEPER-2577: Fix flaky testDuringLeaderSync test.

    testDuringLeaderSync uses the presence of intermediate zoo.cfg.dynamic file 
to decide if the reconfig operation was succeeded or not. This is not a problem 
and is logically correct, however in tests that provisions QuorumPeer directly 
through MainThread, the configFile will be null and the resulted intermediate 
dynamic reconfig file will has a name of "null.cfg.dynamic". This causes 
problem because multiple test cases use MainThread to provision QuorumPeer so 
these tests will share a single "null.cfg.dynamic" file, as opposed to the 
zoo.cfg.dynamic file in their specific test folder when configFile was not 
null. Since we support running concurrent ant unit tests in apache build 
infrastructure, it is highly likely that multiple tests that rely on this 
shared null.cfg.dynamic file will execute simultaneously which create all sorts 
of failure cases (this also explains why if one tries to reproduce the test 
failures in a standalone fashion the test will always pass, with the absence 
 of the file sharing.).
    
    This is problematic in multiple test cases and in particular cause this 
testDuringLeaderSync fail fairly frequently. There are multiple solutions to 
this problem:
    
    * Implement retry with timeout logic when detecting the presence of 
intermediate config files in testDuringLeaderSync. This will fix this specific 
test but the fix is non-deterministic and other tests might still fail because 
of sharing of null.cfg.dynamic file.
    
    * Always make sure to to feed a real config file when provision QuorumPeer. 
This however is an overkill as some cases a pure artificial QuorumPeer w/o real 
config file fit the use case well.
    
    * The approach used in this patch: when generating intermediate config 
files if configFile is null, generate a random unique file name. This code path 
is supposed to only active when executing unit tests. 
    
    Testing done: running this patch on apache jenkins for the past week with 
500 runs. Not only this test is fixed but overall stability of entirely unit 
tests are improved.
    
    Please review @shralex @arshadmohammad 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hanm/zookeeper working

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #315
    
----
commit f3b813924323a5fb64f9242303f3f5dd40d91a83
Author: Michael Han <[email protected]>
Date:   2017-07-23T06:03:46Z

    ZOOKEEPER-2577: Fix flaky testDuringLeaderSync test caused by null config 
file name.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to