[ 
https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6828:
--------------------------
    Labels: maintenance mesosphere  (was: maintenance)

> Consider ways for frameworks to ignore offers with an Unavailability
> --------------------------------------------------------------------
>
>                 Key: MESOS-6828
>                 URL: https://issues.apache.org/jira/browse/MESOS-6828
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Joris Van Remoortere
>            Assignee: Artem Harutyunyan
>              Labels: maintenance, mesosphere
>
> Due to the opt-in nature of maintenance primitives in Mesos, there is a 
> deficiency for cluster administrators when frameworks have not opted in.
> An example case:
> - Cluster with reasonable churn (tasks terminate naturally)
> - Operator specifies maintenance schedule
> Ideally *even* in a world where none of the frameworks had opted in to 
> maintenance primitives the operator would have some way of preventing 
> frameworks from scheduling further work on agents in the schedule. The 
> natural termination of the tasks in the cluster would allow the nodes to 
> drain gracefully and the operator to then perform maintenance.
> 2 options that have been discussed so far:
> # Provide a capability for frameworks to automatically filter offers with an 
> {{Unavailability}} set.
> #* Pro: Finer grained control. Allows other frameworks to keep scheduling 
> short lived tasks that can complete before the Unavailability.
> #* Con: All frameworks have to be updated. Consider making this an 
> environment variable to the scheduler driver for legacy frameworks.
> # Provide a flag on the master to filter all offers with an 
> {{Unavailability}} set.
> #* Pro: Immediately actionable / usable.
> #* Con: Coarse grained. Some frameworks may suffer efficiency.
> #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an 
> entire cluster will prevent any frameworks from scheduling further work, 
> potentially stalling the cluster.
> Action Items: Provide further context for each option and consider others. We 
> need to ensure we have something immediately consumable by users to fill the 
> gap until maintenance primitives are the norm. We also need to ensure we 
> prevent dangerous scenarios like the Con listed for option #2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to