Re: Review Request 39325: Fixed race between coordinator election and recovery in replicated log.

Neil Conway Wed, 14 Oct 2015 18:22:15 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39325/
-----------------------------------------------------------


(Updated Oct. 15, 2015, 1:21 a.m.)


Review request for mesos, Jie Yu, Joris Van Remoortere, and Timothy Chen.


Changes
-------

Fixed Linux compile errors in test code.


Bugs: MESOS-3280
    https://issues.apache.org/jira/browse/MESOS-3280


Repository: mesos


Description
-------

MESOS-3280. The basic problem is that replicas silently ignore inbound Promise
and Write requests if they have not finished the recovery protocol yet (because
they can't safely vote on such requests). Hence, if we try to do a Paxos round
while a quorum of nodes have not finished recovering, the Paxos round will never
complete. In particular, this might happen during coordinator election:
coordinator election (which is implemented as performing a full Paxos round)
starts as soon as the candidate coordinator replica has finished the recovery
protocol. If several nodes start concurrently, a quorum of those nodes might
still be executing the recovery protocol, and hence the coordinator will never
be elected.

To address this, add "ignored" responses to the Promise and Write sub-protocols:
if a proposer sees a quorum of "ignored" responses to a promise or write request
it has issued, it knows the request will never succeed.  When used for
coordinator election, the current coding will retry immediately (without a
backoff).

Note that replicas will still silently drop promise/write requests if another
kind of problem occurs (e.g., an I/O error prevents reading/writing log
data). We might consider changing this, although it will require some thought:
e.g., if a replica's disk is broken, sending an "ignored" message on every
request might flood the network.

CODE REVIEW TO DISCUSS / FIX:

* Test mock is incredibly ugly: it works, but we clearly need a better approach
  before committing this. I've been chatting with @tnachen to find a better
  approach but haven't got anything that works yet.

* Should we add a backoff when retrying after a failed coordinator election?

* Should we also send back an "ignored" response if an I/O error occurs?


Diffs (updated)
-----

  src/log/consensus.cpp 59f80d02d1d3c11683631f3fc5f6e923b5ebdf96 
  src/log/coordinator.cpp 5500bca77f3020e0051010c5c178a20a3c7ad44a 
  src/log/replica.hpp 33d3f1d9e89035936c67739898e73a06b391fcd0 
  src/log/replica.cpp 75d39ff56822bf00fce9daf5c1e3befb75f2e039 
  src/messages/log.proto d73b33f865963292af580945659ad0e800f2a204 
  src/state/log.cpp a75a605a4b0edb8863a3378e2133df7d6eb1cc3d 
  src/tests/log_tests.cpp f2dd47cfbe73fb18c360a637db009b7d391a782e 
  src/tests/slave_tests.cpp 10a4fa7eaa8e868ccc6d60ac220d66a4f0a523b4 

Diff: https://reviews.apache.org/r/39325/diff/


Testing
-------

"make check" passes, including a new test that uses a newly constructed mock to 
ensure we're testing the message schedule described above.

I also wrote a script stops and starts mesos-master in a loop, removing the 
replicated log each time. Without the patch, this occasionally fails with a 
"registry fetch" timeout; with the patch, you can observe several scenarios 
where coordinator election is reborted and retried because a quorum of ignored 
responses is seen. Note that in some cases, we need to retry coordinator 
election up to ~70 times (!), because we don't currently use a backoff; that 
should probably be fixed, per comments above. But the important point is that 
election eventually succeeds and we don't hang.


Thanks,

Neil Conway

Re: Review Request 39325: Fixed race between coordinator election and recovery in replicated log.

Reply via email to