[
https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801457#comment-15801457
]
Alexander Rukletsov commented on MESOS-6828:
--------------------------------------------
Another option to consider: we can have allocator skip agents that are
"unavailable" in the allocation loop.
> Consider ways for frameworks to ignore offers with an Unavailability
> --------------------------------------------------------------------
>
> Key: MESOS-6828
> URL: https://issues.apache.org/jira/browse/MESOS-6828
> Project: Mesos
> Issue Type: Improvement
> Reporter: Joris Van Remoortere
> Assignee: Artem Harutyunyan
> Labels: maintenance
>
> Due to the opt-in nature of maintenance primitives in Mesos, there is a
> deficiency for cluster administrators when frameworks have not opted in.
> An example case:
> - Cluster with reasonable churn (tasks terminate naturally)
> - Operator specifies maintenance schedule
> Ideally *even* in a world where none of the frameworks had opted in to
> maintenance primitives the operator would have some way of preventing
> frameworks from scheduling further work on agents in the schedule. The
> natural termination of the tasks in the cluster would allow the nodes to
> drain gracefully and the operator to then perform maintenance.
> 2 options that have been discussed so far:
> # Provide a capability for frameworks to automatically filter offers with an
> {{Unavailability}} set.
> #* Pro: Finer grained control. Allows other frameworks to keep scheduling
> short lived tasks that can complete before the Unavailability.
> #* Con: All frameworks have to be updated. Consider making this an
> environment variable to the scheduler driver for legacy frameworks.
> # Provide a flag on the master to filter all offers with an
> {{Unavailability}} set.
> #* Pro: Immediately actionable / usable.
> #* Con: Coarse grained. Some frameworks may suffer efficiency.
> #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an
> entire cluster will prevent any frameworks from scheduling further work,
> potentially stalling the cluster.
> Action Items: Provide further context for each option and consider others. We
> need to ensure we have something immediately consumable by users to fill the
> gap until maintenance primitives are the norm. We also need to ensure we
> prevent dangerous scenarios like the Con listed for option #2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)