Joseph Wu created MESOS-5576:
--------------------------------

             Summary: Masters may drop the first message they send between 
masters after a network partition
                 Key: MESOS-5576
                 URL: https://issues.apache.org/jira/browse/MESOS-5576
             Project: Mesos
          Issue Type: Bug
          Components: leader election, master, replicated log
    Affects Versions: 0.28.2
         Environment: Observed in an OpenStack environment where each master 
lives on a separate VM.
            Reporter: Joseph Wu


We observed the following situation in a cluster of five masters:
|| Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
| 0 | Follower | Follower | Follower | Follower | Leader |
| 1 | Follower | Follower | Follower | Follower || Partitioned from cluster by 
downing this VM's network ||
| 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
leadership |
| 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
leader | Still down |
| 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
Still down |
| 5 | Leader | Follower | Follower | Follower | Still down |
| 6 | Leader | Follower | Follower | Follower | Comes back up |
| 7 | Leader | Follower | Follower | Follower | Follower |
| 8 || Partitioned in the same way as Master 5 | Follower | Follower | Follower 
| Follower |
| 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
Follower | Follower |
| 10 | Still down | Performs consensus | Replies to leader | Replies to leader 
|| Doesn't get the message! ||
| 11 | Still down | Performs writing | Acks to leader | Acks to leader || Acks 
to leader ||
| 12 | Still down | Leader | Follower | Follower | Follower |

Master 1 sends a series of messages to the recently-restarted Master 5.  The 
first message is dropped, but subsequent messages are not dropped.

This appears to be due to a stale link between the masters.  Before leader 
election, the replicated log actors create a network watcher, which adds links 
to masters that join the ZK group:
https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

This link does not appear to break (Master 1 -> 5) when Master 5 goes down, 
perhaps due to how the network partition was induced (in the hypervisor layer, 
rather than in the VM itself).

When Master 1 tries to send an {{PromiseRequest}} to Master 5, we do not 
observe the [expected log 
message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]

Instead, we see a log line in Master 1:
{code}
process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
not connected
{code}

The broken link is removed by the libprocess {{socket_manager}} and the 
following {{WriteRequest}} from Master 1 to Master 5 succeeds via a new socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to