Hi Maxime,

Thanks for the feedback!

The proposed approach is definitely simplistic. The "Discussion"
section of the design doc describes some of the rationale for starting
with a very simple scheme: basically, because

(a) we want to assign clear semantics to the levels of the hierarchy
(regions are far away from each other and inter-region network links
have high latency; racks are close together and inter-rack network
links have low latency).

(b) we don't want to make life too difficult for framework authors.

(c) most server software (e.g., HDFS, Kafka, Cassandra, etc.) only
understands a simple hierarchy -- in many cases, just a single level
("racks"), or occasionally two levels ("racks" and "DCs").

Can you elaborate on the use-cases that you see for a more complex
hierarchy of fault domains? I'd be happy to chat off-list if you'd
prefer.

Thanks!

Neil

On Tue, Apr 18, 2017 at 1:33 AM, Maxime Brugidou
<maxime.brugi...@gmail.com> wrote:
> Hi Neil,
>
> I really like the idea of incorporating the concept of fault domains in
> Mesos, however I feel like the implementation proposed is a bit narrow to be
> actually useful for most users.
>
> I feel like we could make the fault domains definition more generic. As an
> example in our setup we would like to have something like Region > Building
>> Cage > Pod > Rack. Failure domains would be hierarchically arranged
> (meaning one domain in a lower level can only be included in one domain
> above).
>
> As a concrete example, we could have the mesos masters be aware of the fault
> domain hierarchy (with a config map for example), and slaves would just need
> to declare their lowest-level domain (for example their rack id). Then
> frameworks could use this domain hierarchy at will. If they need to "spread"
> their tasks for a very highly available setup, they could first spread using
> the highest fault domain (like the region), then if they have enough tasks
> to launch they could spread within each sub-domain recursively until they
> run out of tasks to spread. We do not need to artificially limit the number
> of levels of fault domains and the name of the fault domains. Schedulers do
> not need to know the names either, just the hierarchy.
>
> Then, to provide the other feature of "remote" slaves that you describe, we
> could configure the mesos master to only send offers from a "default" local
> fault domain, and frameworks would need to advertise a certain capability to
> receive offers for other remote fault domains.
>
> I feel we could implement this by identifying a fault domain with a simple
> list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"]
> or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level
> fault domains and schedulers could use this arbitrarily as a hierarchical
> list.
>
> Thanks,
> Maxime
>
> On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <neil.con...@gmail.com> wrote:
>>
>> Folks,
>>
>> I'd like to enhance Mesos to support a first-class notion of "fault
>> domains" -- i.e., identifying the "rack" and "region" (DC) where a
>> Mesos agent or master is located. The goal is to enable two main
>> features:
>>
>> (1) To make it easier to write "rack-aware" Mesos frameworks that are
>> portable to different Mesos clusters.
>>
>> (2) To improve the experience of configuring Mesos with a set of
>> masters and agents in one DC, and another pool of "remote" agents in a
>> different DC.
>>
>> For more information, please see the design doc:
>>
>>
>> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>>
>> I'd love any feedback, either directly on the Google doc or via email.
>>
>> Thanks,
>> Neil

Reply via email to