[
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15043346#comment-15043346
]
Shuai Lin commented on MESOS-1806:
----------------------------------
There are two situations to handle:
* Etcd servers wound't accept requests from clients during the leader election
phase. So when there is a leader re-election among the etcd servers, the
request from the current master to renew the timestamp of the {{v2/keys/mesos}}
node would fail, and the current code would immediately retry with the next
server, which would refuse the request as well. Thus the master would exit due
to all servers fail its requests. The same happens with slaves -- detector
would fail after requests to all the etcd servers are refused. To solve this,
we can add logic to wait for a while before trying the next server.
* If the the current master somehow fails to update the {{v2/keys/mesos}} node
in time, that node would expire, the detector would detect this, commit suicide
due to lost of leadership. This is correct behavior, but the current TTL is
kind of small: only 5 seconds, and the current master is set to update the node
at 80% of the TTL, i.e. 4 seconds, so the chance of this problem is not that
low, e.g. if there happens ephemeral network problem. This can be achieved by
increase the TTL to 10 seconds, or let the current master try to update the
node at 60% of the TTL.
[~cmaloney] [~benjaminhindman] What do you think?
> Substituting etcd for Zookeeper
> -------------------------------
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
> Issue Type: Task
> Components: leader election
> Reporter: Ed Ropple
> Assignee: Shuai Lin
> Priority: Minor
>
> <adam_mesos> eropple: Could you also file a new JIRA for Mesos to drop ZK
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on
> that one.
> --
> Consider it filed. =)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)