[
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957482#comment-14957482
]
Neil Conway commented on MESOS-3280:
------------------------------------
Fix for the race condition is here: https://reviews.apache.org/r/39325/
Note that the testing mock needs to be rethought (working on how to do this
properly), and a few details need discuss (e.g., whether to use a backoff when
retrying a failed coordinator election).
> Master fails to access replicated log after network partition
> -------------------------------------------------------------
>
> Key: MESOS-3280
> URL: https://issues.apache.org/jira/browse/MESOS-3280
> Project: Mesos
> Issue Type: Bug
> Components: master, replicated log
> Affects Versions: 0.23.0
> Environment: Zookeeper version 3.4.5--1
> Reporter: Bernd Mathiske
> Assignee: Neil Conway
> Labels: mesosphere
> Attachments: rep-log-race-cond-logs.tar.gz,
> rep-log-startup-race-test-1.patch
>
>
> In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a
> network partition is forced, all the masters apparently lose access to their
> replicated log. The leading master halts. Unknown reasons, but presumably
> related to replicated log access. The others fail to recover from the
> replicated log. Unknown reasons. This could have to do with ZK setup, but it
> might also be a Mesos bug.
> This was observed in a Chronos test drive scenario described in detail here:
> https://github.com/mesos/chronos/issues/511
> With setup instructions here:
> https://github.com/mesos/chronos/issues/508
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)