[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585524#comment-14585524
 ] 

Hadoop QA commented on ZOOKEEPER-2212:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12739553/0001-ZOOKEEPER-2212-distributed-race-condition-related-to.patch
  against trunk revision 1685200.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified 
tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2769//console

This message is automatically generated.

> distributed race condition related to QV version
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-2212
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2212
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>            Reporter: Akihiro Suda
>            Assignee: Alexander Shraer
>            Priority: Critical
>         Attachments: 
> 0001-ZOOKEEPER-2212-distributed-race-condition-related-to.patch
>
>
> When a joiner is listed as an observer in an initial config,
> the joiner should become a non-voting follower (not an observer) until 
> reconfig is triggered. 
> [(Link)|http://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#sc_reconfig_general]
> I found a distributed race-condition situation where an observer keeps being 
> an observer and cannot become a non-voting follower.
> This race condition happens when an observer receives an UPTODATE Quorum 
> Packet from the leader:2888/tcp *after* receiving a Notification FLE Packet 
> of which n.config version is larger than the observer's one from 
> leader:3888/tcp.
> h4. Detail
>  * Problem: An observer cannot become a non-voting follower
>  * Cause: Cannot restart FLE
>  * Cause: In {{QuorumPeer.run()}}, cannot shutdown {{Observer}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1014]
>  * Cause: In {{QuorumPeer.run()}}, cannot return from 
> {{Observer.observeLeader()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1010]
>  * Cause: In {{Observer.observeLeader()}}, {{Learner.syncWithLeader()}} does 
> not throw an exception of "changes proposed in reconfig" 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/Observer.java#L79]
>  * Cause: In {{switch(qp.getType()) case UPTODATE}} of 
> {{Learner.syncWithLeader()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L492-507],
>  {{QuorumPeer.processReconfig()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1644]returns
>  false with a log message like ["2 setQuorumVerifier called with known or old 
> config 4294967296. Current version: 
> 4294967296"|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/zk2.log].
>  
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1369]
> ,
>  * Cause: The observer have already received a Notification 
> Packet({{n.config.version=4294967296}}) and invoked 
> {{QuorumPeer.processReconfig()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L291-304]
>    
> h4. How I found this bug
> I found this bug using [Earthquake|http://osrg.github.io/earthquake/], our 
> open-source dynamic model checker for real implementations of distributed 
> systems.
> Earthquakes permutes C/Java function calls, Ethernet packets, and injected 
> fault events in various orders so as to find implementation-level bugs of the 
> distributed system.
> When Earthquake finds a bug, Earthquake automatically records [the event 
> history|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/json]
>  and helps the user to analyze which permutation of events triggers the bug.
> I analyzed Earthquake's event histories and found that the bug is triggered 
> when an observer receives an UPTODATE *after* receiving a specific kind of 
> FLE packet.
> h4. How to reproduce this bug
> You can also easily reproduce the bug using Earthquake.
> I made a Docker container 
> [osrg/earthquake-zookeeper-2212|https://registry.hub.docker.com/u/osrg/earthquake-zookeeper-2212/]
>  on Docker hub:
> {code}
>     host$ sudo modprobe openvswitch
>     host$ docker run --privileged -t -i --rm osrg/earthquake-zookeeper-2212
>     guest$ ./000-prepare.sh
>     [INFO] Starting Earthquake Ethernet Switch
>     [INFO] Starting Earthquake Orchestrator
>     [INFO] Starting Earthquake Ethernet Inspector
>     [IMPORTANT] Please kill the processes (switch=1234, orchestrator=1235, 
> and inspector=1236) after you finished all of the experiments
>     [IMPORTANT] Please continue to 100-run-experiment.sh..
>     guest$ ./100-run-experiment.sh
>     [IMPORTANT] THE BUG WAS REPRODUCED!
>     guest$ kill -9 1234 1235 1236
> {code}
> Note that {{--privileged}} is needed, as this container uses Docker-in-Docker.
> For further information about reproducing this bug, please refer to 
> https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to