Joseph Wu created MESOS-5576:
--------------------------------
Summary: Masters may drop the first message they send between
masters after a network partition
Key: MESOS-5576
URL: https://issues.apache.org/jira/browse/MESOS-5576
Project: Mesos
Issue Type: Bug
Components: leader election, master, replicated log
Affects Versions: 0.28.2
Environment: Observed in an OpenStack environment where each master
lives on a separate VM.
Reporter: Joseph Wu
We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader |
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower |
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |
Master 1 sends a series of messages to the recently-restarted Master 5. The
first message is dropped, but subsequent messages are not dropped.
This appears to be due to a stale link between the masters. Before leader
election, the replicated log actors create a network watcher, which adds links
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
This link does not appear to break (Master 1 -> 5) when Master 5 goes down,
perhaps due to how the network partition was induced (in the hypervisor layer,
rather than in the VM itself).
When Master 1 tries to send an {{PromiseRequest}} to Master 5, we do not
observe the [expected log
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
Instead, we see a log line in Master 1:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is
not connected
{code}
The broken link is removed by the libprocess {{socket_manager}} and the
following {{WriteRequest}} from Master 1 to Master 5 succeeds via a new socket.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)