[
https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949915#comment-15949915
]
Joris Van Remoortere commented on MESOS-6828:
---------------------------------------------
Based on some offline discussion I want to suggest that the least dangerous
solution (in my opinion) is to have frameworks prefer offers with the longest
availability by default.
Aurora is a good example of a framework that collects offers and has the
ability to express a preference while iterating the offers to match a task to
launch.
Preferring offers with no (or longest in the future) unavailability will
naturally tend new tasks away from machines that will be entering maintenace.
A benefit of this approach is that the agents in the schedule will still be
used if there is demand pressure for resources by the framework.
> Consider ways for frameworks to ignore offers with an Unavailability
> --------------------------------------------------------------------
>
> Key: MESOS-6828
> URL: https://issues.apache.org/jira/browse/MESOS-6828
> Project: Mesos
> Issue Type: Improvement
> Reporter: Joris Van Remoortere
> Assignee: Artem Harutyunyan
> Labels: maintenance
>
> Due to the opt-in nature of maintenance primitives in Mesos, there is a
> deficiency for cluster administrators when frameworks have not opted in.
> An example case:
> - Cluster with reasonable churn (tasks terminate naturally)
> - Operator specifies maintenance schedule
> Ideally *even* in a world where none of the frameworks had opted in to
> maintenance primitives the operator would have some way of preventing
> frameworks from scheduling further work on agents in the schedule. The
> natural termination of the tasks in the cluster would allow the nodes to
> drain gracefully and the operator to then perform maintenance.
> 2 options that have been discussed so far:
> # Provide a capability for frameworks to automatically filter offers with an
> {{Unavailability}} set.
> #* Pro: Finer grained control. Allows other frameworks to keep scheduling
> short lived tasks that can complete before the Unavailability.
> #* Con: All frameworks have to be updated. Consider making this an
> environment variable to the scheduler driver for legacy frameworks.
> # Provide a flag on the master to filter all offers with an
> {{Unavailability}} set.
> #* Pro: Immediately actionable / usable.
> #* Con: Coarse grained. Some frameworks may suffer efficiency.
> #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an
> entire cluster will prevent any frameworks from scheduling further work,
> potentially stalling the cluster.
> Action Items: Provide further context for each option and consider others. We
> need to ensure we have something immediately consumable by users to fill the
> gap until maintenance primitives are the norm. We also need to ensure we
> prevent dangerous scenarios like the Con listed for option #2.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)