[
https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033932#comment-15033932
]
Neil Conway commented on MESOS-3548:
------------------------------------
Hi Elouan,
That's awesome that you're interested in this area! We're working on setting up
a special-interest group for federation, and we'll be sure to include you.
> Investigate federations of Mesos masters
> ----------------------------------------
>
> Key: MESOS-3548
> URL: https://issues.apache.org/jira/browse/MESOS-3548
> Project: Mesos
> Issue Type: Improvement
> Reporter: Neil Conway
> Labels: federation, mesosphere, multi-dc
>
> In a large Mesos installation, the operator might want to ensure that even if
> the Mesos masters are inaccessible or failed, new tasks can still be
> scheduled (across multiple different frameworks). HA masters are only a
> partial solution here: the masters might still be inaccessible due to a
> correlated failure (e.g., Zookeeper misconfiguration/human error).
> To support this, we could support the notion of "hierarchies" or
> "federations" of Mesos masters. In a Mesos installation with 10k machines,
> the operator might configure 10 Mesos masters (each of which might be HA) to
> manage 1k machines each. Then an additional "meta-Master" would manage the
> allocation of cluster resources to the 10 masters. Hence, the failure of any
> individual master would impact 1k machines at most. The meta-master might not
> have a lot of work to do: e.g., it might be limited to occasionally
> reallocating cluster resources among the 10 masters, or ensuring that newly
> added cluster resources are allocated among the masters as appropriate.
> Hence, the failure of the meta-master would not prevent any of the individual
> masters from scheduling new tasks. A single framework instance probably
> wouldn't be able to use more resources than have been assigned to a single
> Master, but that seems like a reasonable restriction.
> This feature might also be a good fit for a multi-datacenter deployment of
> Mesos: each Mesos master instance would manage a single DC. Naturally,
> reducing the traffic between frameworks and the meta-master would be
> important for performance reasons in a configuration like this.
> Operationally, this might be simpler if Mesos processes were self-hosting
> ([MESOS-3547]).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)