[
https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588449#comment-14588449
]
Vinod Kone commented on MESOS-2246:
-----------------------------------
I think we solved the first part of the problem, rate limiting slave removals.
We still haven't solved improving the scalability of health checks and being
SLA aware. Since they latter can be epics in themselves we can resolve this and
open new ones.
> Improve slave health-checking
> -----------------------------
>
> Key: MESOS-2246
> URL: https://issues.apache.org/jira/browse/MESOS-2246
> Project: Mesos
> Issue Type: Epic
> Components: master, slave
> Reporter: Dominic Hamon
>
> In the event of a network partition, or other systemic issues, we may see
> widespread slave removal. There are several approaches we can take to
> mitigate this issue including, but not limited to:
> . rate limit the slave removal
> . change how we do health checking to not rely on a single point of view
> . work with frameworks to determine SLA of running services before removing
> the slave
> . manual control to allow operator intervention
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)