Neil Conway created MESOS-3548:
----------------------------------

             Summary: Support federations of Mesos masters
                 Key: MESOS-3548
                 URL: https://issues.apache.org/jira/browse/MESOS-3548
             Project: Mesos
          Issue Type: Improvement
            Reporter: Neil Conway


In a large Mesos installation, the operator might want to ensure that even if 
the Mesos masters are inaccessible or failed, new tasks can still be scheduled 
(across multiple different frameworks). HA masters are only a partial solution 
here: the masters might still be inaccessible due to a correlated failure 
(e.g., Zookeeper misconfiguration/human error).

To support this, we could support the notion of "hierarchies" or "federations" 
of Mesos masters. In a Mesos installation with 10k machines, the operator might 
configure 10 Mesos masters (each of which might be HA) to manage 1k machines 
each. Then an additional "meta-Master" would manage the allocation of cluster 
resources to the 10 masters. Hence, the failure of any individual master would 
impact 1k machines at most. The meta-master might not have a lot of work to do: 
e.g., it might be limited to occasionally reallocating cluster resources among 
the 10 masters, or ensuring that newly added cluster resources are allocated 
among the masters as appropriate. Hence, the failure of the meta-master would 
not prevent any of the individual masters from scheduling new tasks. A single 
framework instance probably wouldn't be able to use more resources than have 
been assigned to a single Master, but that seems like a reasonable restriction.

This feature might also be a good fit for a multi-datacenter deployment of 
Mesos: each Mesos master instance would manage a single DC. Naturally, reducing 
the traffic between frameworks and the meta-master would be important for 
performance reasons in a configuration like this.

Operationally, this might be simpler if Mesos processes were self-hosting 
([MESOS-3547]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to