Re: Review Request 39325: Fixed race between coordinator election and recovery in replicated log.

Neil Conway Mon, 19 Oct 2015 16:34:42 -0700


> On Oct. 15, 2015, 12:31 a.m., Jie Yu wrote:
> > src/log/replica.cpp, line 209
> > <https://reviews.apache.org/r/39325/diff/2/?file=1098524#file1098524line209>
> >
> >     We typically do the cleanup in a separate patch to reduce the size for 
> > the main patch (easiser for reviewers to review).


Sounds good!


> On Oct. 15, 2015, 12:31 a.m., Jie Yu wrote:
> > src/messages/log.proto, line 148
> > <https://reviews.apache.org/r/39325/diff/2/?file=1098525#file1098525line148>
> >
> >     I am just thinking about the rolling upgrade case. What happens if the 
> > old coordinator receives a response from a new replica. According to the 
> > current semantics, it'll be treated as a NACK. Is that OK? Will that 
> > unnecessarily demote the old coordinator?
> >     
> >     If you think that's OK, please add a comment about why it is OK (e.g., 
> > it'll eventually succeed when all replicas/coordinator are updated). If 
> > that's the case, we definitely need to call out this in upgrades.md.
> >     
> >     Ditto for the 'ignored' field in WriteResponse.
> 
> Neil Conway wrote:
>     As I see it, we have three options when the candidate coordinator is old 
> (<= 0.25) but one or more replicas are new:
>     
>     1. New replica sends "ignored = true, okay = false", which means that the 
> coordinator will view this as a NACK and potentially get demoted.
>     2. New replica sends "ignored = true, okay = true": clearly this is wrong
>     3. New replica only sends an "ignored" response if the master is 
> sufficiently new; otherwise, it keeps the old behavior of silently ignoring 
> the promise/write request
>     
>     Obviously #2 is a non-starter; I think #1 should be fine, and is 
> preferrable to the complexity of having replicas track the coordinator's 
> version.

I implemented #1. I thought about how to discuss this in upgrades.md, but it is 
such an implementation detail that I'm not sure it is worth going into detail. 
Happy to take another look if you disagree.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39325/#review102710
-----------------------------------------------------------


On Oct. 19, 2015, 11:29 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39325/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2015, 11:29 p.m.)
> 
> 
> Review request for mesos, Jie Yu, Joris Van Remoortere, and Timothy Chen.
> 
> 
> Bugs: MESOS-3280
>     https://issues.apache.org/jira/browse/MESOS-3280
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> MESOS-3280. The basic problem is that replicas silently ignore inbound Promise
> and Write requests if they have not finished the recovery protocol yet 
> (because
> they can't safely vote on such requests). Hence, if we try to do a Paxos round
> while a quorum of nodes have not finished recovering, the Paxos round will 
> never
> complete. In particular, this might happen during coordinator election:
> coordinator election (which is implemented as performing a full Paxos round)
> starts as soon as the candidate coordinator replica has finished the recovery
> protocol. If several nodes start concurrently, a quorum of those nodes might
> still be executing the recovery protocol, and hence the coordinator will never
> be elected.
> 
> To address this, add "ignored" responses to the Promise and Write 
> sub-protocols:
> if a proposer sees a quorum of "ignored" responses to a promise or write 
> request
> it has issued, it knows the request will never succeed.  When used for
> coordinator election, the current coding will retry immediately (without a
> backoff).
> 
> Note that replicas will still silently drop promise/write requests if another
> kind of problem occurs (e.g., an I/O error prevents reading/writing log
> data). We might consider changing this, although it will require some thought:
> e.g., if a replica's disk is broken, sending an "ignored" message on every
> request might flood the network.
> 
> CODE REVIEW TO DISCUSS / FIX:
> 
> * Test mock is incredibly ugly: it works, but we clearly need a better 
> approach
>   before committing this. I've been chatting with @tnachen to find a better
>   approach but haven't got anything that works yet.
> 
> * Should we add a backoff when retrying after a failed coordinator election?
> 
> * Should we also send back an "ignored" response if an I/O error occurs?
> 
> 
> Diffs
> -----
> 
>   src/log/consensus.cpp 59f80d02d1d3c11683631f3fc5f6e923b5ebdf96 
>   src/log/coordinator.cpp 5500bca77f3020e0051010c5c178a20a3c7ad44a 
>   src/log/replica.hpp 33d3f1d9e89035936c67739898e73a06b391fcd0 
>   src/log/replica.cpp 75d39ff56822bf00fce9daf5c1e3befb75f2e039 
>   src/messages/log.proto d73b33f865963292af580945659ad0e800f2a204 
>   src/tests/log_tests.cpp f2dd47cfbe73fb18c360a637db009b7d391a782e 
> 
> Diff: https://reviews.apache.org/r/39325/diff/
> 
> 
> Testing
> -------
> 
> "make check" passes, including a new test that uses a newly constructed mock 
> to ensure we're testing the message schedule described above.
> 
> I also wrote a script stops and starts mesos-master in a loop, removing the 
> replicated log each time. Without the patch, this occasionally fails with a 
> "registry fetch" timeout; with the patch, you can observe several scenarios 
> where coordinator election is reborted and retried because a quorum of 
> ignored responses is seen. Note that in some cases, we need to retry 
> coordinator election up to ~70 times (!), because we don't currently use a 
> backoff; that should probably be fixed, per comments above. But the important 
> point is that election eventually succeeds and we don't hang.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 39325: Fixed race between coordinator election and recovery in replicated log.

Reply via email to