Akihiro Suda created ZOOKEEPER-2212:
---------------------------------------

             Summary: distributed race condition related to QV version
                 Key: ZOOKEEPER-2212
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2212
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
            Reporter: Akihiro Suda


When a joiner is listed as an observer in an initial config,
the joiner should become a non-voting follower (not an observer) until reconfig 
is triggered. 
[(Link)|http://zookeeper.apache.org/doc/trunk/zookeeperReconfig.html#sc_reconfig_general]

I found a distributed race-condition situation where an observer keeps being an 
observer and cannot become a non-voting follower.

This race condition happens when an observer receives an UPTODATE Quorum Packet 
from the leader:2888/tcp *after* receiving a Notification FLE Packet of which 
n.config version is larger than the observer's one from leader:3888/tcp.

h4. Detail
 * Problem: An observer cannot become a non-voting follower
 * Cause: Cannot restart FLE
 * Cause: In QuorumPeer.run(), cannot shutdown {{Observer}} 
[(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1014]
 * Cause: In {{QuorumPeer.run()}}, cannot return from 
{{Observer.observeLeader()}} 
[(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1010]
 * Cause: In {{Observer.observeLeader()}}, {{Learner.syncWithLeader()}} does 
not throw an exception of "changes proposed in reconfig" 
[(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/Observer.java#L79]
 * Cause: In {{Learner.syncWithLeader()}}, {{QuorumPeer.processReconfig()}} 
returns false with a log message like ["2 setQuorumVerifier called with known 
or old config 4294967296. Current version: 
4294967296"|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/zk2.log].
 * Cause: The observer have already received a Notification 
Packet({{n.config.version=4294967296}}) and invoked 
{{QuorumPeer.processReconfig()}} 
[(Link)|https://github.com/apache/zookeeper/blob/98a3cabfa279833b81908d72f1c10ee9f598a045/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L291-304]
   
h4. How I found this bug
I found this bug using [Earthquake|http://osrg.github.io/earthquake/], our 
open-source dynamic model checker for real implementations of distributed 
systems.

Earthquakes permutes C/Java function calls, Ethernet packets, and injected 
fault events in various orders so as to find implementation-level bugs of the 
distributed system.

When Earthquake finds a bug, Earthquake automatically records [the event 
history|https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether/example-output/3.REPRODUCED/json]
 and helps the user to analyze which permutation of events triggers the bug.

I analyzed Earthquake's event histories and found that the bug is triggered 
when an observer receives an UPTODATE *after* receiving a specific kind of FLE 
packet.

h4. How to reproduce this bug
You can also easily reproduce the bug using Earthquake.

{code}
    host$ sudo modprobe openvswitch
    host$ docker run --privileged -t -i --rm osrg/earthquake-zookeeper-2212
    guest$ ./000-prepare.sh
    [INFO] Starting Earthquake Ethernet Switch
    [INFO] Starting Earthquake Orchestrator
    [INFO] Starting Earthquake Ethernet Inspector
    [IMPORTANT] Please kill the processes (switch=1234, orchestrator=1235, and 
inspector=1236) after you finished all of the experiments
    [IMPORTANT] Please continue to 100-run-experiment.sh..
    guest$ ./100-run-experiment.sh
    [IMPORTANT] THE BUG WAS REPRODUCED!
    guest$ kill -9 1234 1235 1236
{code}

Note that {{--previleged}} is needed, as this container uses Docker-in-Docker.

For further information about reproducing this bug, please refer to 
https://github.com/osrg/earthquake/blob/v0.1/example/zk-found-bug.ether




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to