[
https://issues.apache.org/jira/browse/MESOS-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319009#comment-15319009
]
Adam B commented on MESOS-5545:
-------------------------------
Thinking bigger picture, rack awareness is just a crude approximation of the
metrics you really care about: 1) latency to the data/nodes your task cares
about, and 2) fault domains. Latency isn't as trivial as a static topology and
can vary from cluster to cluster (and can be even more complicated if the data
is replicated), and fault domains may be hierarchical (or overlapping, in the
case of network fault domains vs. power fault domains). Rather than adding
"rack" awareness (plus AZ/region awareness) it may be better to focus on a
qualitative QoS.
Example: my task may require <Xms latency between node A and node B. I could
assume that being on the same "rack" would give me this guarantee, but what if
the operator installed a second switch in the rack, separating A and B, and the
connection between them fails. Then, even though the "rack_id" attribute on
these agents stays the same, the latency does not. Conversely, what if A and B
are on different racks, and their infrastructure is upgraded so that the
latency between them drops below my Xms threshold. Now my scheduler is avoiding
these racks even though they meet my performance criteria.
Unfortunately, I don't have any brilliant solutions for acquiring these latency
and fault domain metrics, as much of that is up to the cloud/infrastructure
provider.
> Add rack awareness support for Mesos resources
> ----------------------------------------------
>
> Key: MESOS-5545
> URL: https://issues.apache.org/jira/browse/MESOS-5545
> Project: Mesos
> Issue Type: Story
> Components: hadoop, master
> Reporter: Fan Du
> Attachments: RackAwarenessforMesos-Lite.pdf
>
>
> Resources managed by Mesos master have no topology information of the
> cluster, for example, rack topology. While lots of data center applications
> have rack awareness feature to provide data locality, fault tolerance and
> intelligent task placement. This ticket tries to investigate how to add rack
> awareness for Mesos resources topology.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)