[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587329#comment-14587329
 ] 

Akihiro Suda commented on ZOOKEEPER-2212:
-----------------------------------------

Thank you for information!

The failing test seems under discussion in [ZOOKEEPER-2080].



> distributed race condition related to QV version
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-2212
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2212
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.5.0
>            Reporter: Akihiro Suda
>            Assignee: Akihiro Suda
>            Priority: Critical
>             Fix For: 3.5.1, 3.6.0
>
>         Attachments: 
> 0001-ZOOKEEPER-2212-distributed-race-condition-related-to.patch, 
> ZOOKEEPER-2212-v2.patch, ZOOKEEPER-2212-v3.patch
>
>
> When a joiner is listed as an observer in an initial config,
> the joiner should become a non-voting follower (not an observer) until 
> reconfig is triggered. 
> [(Link)|http://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#sc_reconfig_general]
> I found a distributed race-condition situation where an observer keeps being 
> an observer and cannot become a non-voting follower.
> This race condition happens when an observer receives an UPTODATE Quorum 
> Packet from the leader:2888/tcp *after* receiving a Notification FLE Packet 
> of which n.config version is larger than the observer's one from 
> leader:3888/tcp.
> h4. Detail
>  * Problem: An observer cannot become a non-voting follower
>  * Cause: Cannot restart FLE
>  * Cause: In {{QuorumPeer.run()}}, cannot shutdown {{Observer}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1014]
>  * Cause: In {{QuorumPeer.run()}}, cannot return from 
> {{Observer.observeLeader()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1010]
>  * Cause: In {{Observer.observeLeader()}}, {{Learner.syncWithLeader()}} does 
> not throw an exception of "changes proposed in reconfig" 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/Observer.java#L79]
>  * Cause: In {{switch(qp.getType()) case UPTODATE}} of 
> {{Learner.syncWithLeader()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L492-507],
>  {{QuorumPeer.processReconfig()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1644]returns
>  false with a log message like ["2 setQuorumVerifier called with known or old 
> config 4294967296. Current version: 
> 4294967296"|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/zk2.log].
>  
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1369]
> ,
>  * Cause: The observer have already received a Notification 
> Packet({{n.config.version=4294967296}}) and invoked 
> {{QuorumPeer.processReconfig()}} 
> [(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L291-304]
>    
> h4. How I found this bug
> I found this bug using [Earthquake|http://osrg.github.io/earthquake/], our 
> open-source dynamic model checker for real implementations of distributed 
> systems.
> Earthquakes permutes C/Java function calls, Ethernet packets, and injected 
> fault events in various orders so as to find implementation-level bugs of the 
> distributed system.
> When Earthquake finds a bug, Earthquake automatically records [the event 
> history|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/json]
>  and helps the user to analyze which permutation of events triggers the bug.
> I analyzed Earthquake's event histories and found that the bug is triggered 
> when an observer receives an UPTODATE *after* receiving a specific kind of 
> FLE packet.
> h4. How to reproduce this bug
> You can also easily reproduce the bug using Earthquake.
> I made a Docker container 
> [osrg/earthquake-zookeeper-2212|https://registry.hub.docker.com/u/osrg/earthquake-zookeeper-2212/]
>  on Docker hub:
> {code}
>     host$ sudo modprobe openvswitch
>     host$ docker run --privileged -t -i --rm osrg/earthquake-zookeeper-2212
>     guest$ ./000-prepare.sh
>     [INFO] Starting Earthquake Ethernet Switch
>     [INFO] Starting Earthquake Orchestrator
>     [INFO] Starting Earthquake Ethernet Inspector
>     [IMPORTANT] Please kill the processes (switch=1234, orchestrator=1235, 
> and inspector=1236) after you finished all of the experiments
>     [IMPORTANT] Please continue to 100-run-experiment.sh..
>     guest$ ./100-run-experiment.sh
>     [IMPORTANT] THE BUG WAS REPRODUCED!
>     guest$ kill -9 1234 1235 1236
> {code}
> Note that {{--privileged}} is needed, as this container uses Docker-in-Docker.
> For further information about reproducing this bug, please refer to 
> https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to