[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5193:
-----------------------------------
    Attachment: full.log

I've attached an interleaved version of the log where each line is prefixed 
with the node number. You can see the recovery failures of node3 then node1 
then node2 towards the end.

Interestingly, I took a look with [~jieyu] and it appears there may have been 
some message loss, or connectivity issues:

(1) when node3 gets elected, node2 appears to be offline, it broadcasts an 
implicit promise request to node3 (itself) and node1. *This message is not 
received by node1 for some reason.*

(2) after node3 dies, node1 broadcasts an implicit promise request to node1 
(itself) and node2. *This message is not received by node2 for some reason.*

After this point, only node2 remains, and we do not have quorum.

{quote}
Although, once a master process gets killed the service gets terminated as well.
{quote}

Can you fix that so that the masters are restarted? That is a requirement for 
running HA masters, otherwise we cannot maintain a quorum.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> ----------------------------------------------------------------------
>
>                 Key: MESOS-5193
>                 URL: https://issues.apache.org/jira/browse/MESOS-5193
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.0, 0.27.0
>            Reporter: Priyanka Gupta
>              Labels: master, mesosphere
>         Attachments: full.log, node1.log, node1_after_work_dir.log, 
> node2.log, node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to