[
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909536#comment-14909536
]
Neil Conway commented on MESOS-3280:
------------------------------------
To followup on the bug in the auto-initialization code (first bullet above),
there's a race condition between the log recovery (auto-initialization)
protocol and the coordinator election protocol:
* to elect the coordinator, we try to pass an implicit promise (note that
there's no retry mechanism)
* to recover the log, we do a two-phase broadcast (see RecoverProtocolProcess),
where each node goes from EMPTY => STARTING => VOTING
* if a node in EMPTY or STARTING state receives a promise request, it silently
ignores it
Moreover, AFAICS there is no synchronization between starting the log recovery
protocol and doing coordinator election: coordinator election happens as soon
as we detect we're the Zk leader (RegistrarProcess::recover(), which calls
LogWriterProcess::start(), which tries to be elected as the coordinator),
whereas log recovery/auto-init actually starts earlier (in main() in
master/main.cpp). We wait on the `recovering` promise *locally* before starting
coordinator election at the Zk leader, but that doesn't mean that log recovery
has finished at a quorum of other nodes.
I'll attach a patch with a test case that makes the race condition more likely
by having 2 of the 3 nodes sleep before transitioning from STARTING => VOTING.
I'll also attach a log of an execution that shows the problem; note that you
need to annotate the replicated log code with a bunch of extra LOG()s to see
when messages are ignored (this could also be improved).
There's a few different ways we can fix the problem: e.g., by adding a retry to
the coordinator election protocol, or by ensuring we have a quorum of VOTING
nodes before trying to elect a coordinator (the latter approach seems like it
would be quite racy, though). I'll propose a fix shortly.
Note that there's another possible problem here: depending on the order in
which messages in the log recovery protocol are observed, a node might actually
transition from EMPTY => STARTING => RECOVERING, at which point it will do the
catchup protocol. Per talking with [~jieyu], this seems unexpected, and may be
problematic. I haven't found a reproducible test case yet, but I'll followup
with Jie.
> Master fails to access replicated log after network partition
> -------------------------------------------------------------
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
> Issue Type: Bug
> Components: master, replicated log
> Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
> Reporter: Bernd Mathiske
> Assignee: Neil Conway
> Labels: mesosphere
> Attachments: rep-log-startup-race-test-1.patch
>
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a
> network partition is forced, all the masters apparently lose access to their
> replicated log. The leading master halts. Unknown reasons, but presumably
> related to replicated log access. The others fail to recover from the
> replicated log. Unknown reasons. This could have to do with ZK setup, but it
> might also be a Mesos bug.
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)