[ 
https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771179#comment-15771179
 ] 

Joris Van Remoortere commented on MESOS-6828:
---------------------------------------------

An updated proposal to improve flexibility while still being easily consumable:
# Allow operators to specify a separate start time for when offers should stop 
being sent prior to the actual maintenance window.
# Add an opt-in capability for frameworks to be able to see offers during the 
period described in point #1

By controlling the time period during which offers are not sent out we are able 
to stagger them out based on the maintenance schedule and prevent the stalling 
scenario described in the ticket description.

> Consider ways for frameworks to ignore offers with an Unavailability
> --------------------------------------------------------------------
>
>                 Key: MESOS-6828
>                 URL: https://issues.apache.org/jira/browse/MESOS-6828
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Joris Van Remoortere
>            Assignee: Artem Harutyunyan
>              Labels: maintenance
>
> Due to the opt-in nature of maintenance primitives in Mesos, there is a 
> deficiency for cluster administrators when frameworks have not opted in.
> An example case:
> - Cluster with reasonable churn (tasks terminate naturally)
> - Operator specifies maintenance schedule
> Ideally *even* in a world where none of the frameworks had opted in to 
> maintenance primitives the operator would have some way of preventing 
> frameworks from scheduling further work on agents in the schedule. The 
> natural termination of the tasks in the cluster would allow the nodes to 
> drain gracefully and the operator to then perform maintenance.
> 2 options that have been discussed so far:
> # Provide a capability for frameworks to automatically filter offers with an 
> {{Unavailability}} set.
> #* Pro: Finer grained control. Allows other frameworks to keep scheduling 
> short lived tasks that can complete before the Unavailability.
> #* Con: All frameworks have to be updated. Consider making this an 
> environment variable to the scheduler driver for legacy frameworks.
> # Provide a flag on the master to filter all offers with an 
> {{Unavailability}} set.
> #* Pro: Immediately actionable / usable.
> #* Con: Coarse grained. Some frameworks may suffer efficiency.
> #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an 
> entire cluster will prevent any frameworks from scheduling further work, 
> potentially stalling the cluster.
> Action Items: Provide further context for each option and consider others. We 
> need to ensure we have something immediately consumable by users to fill the 
> gap until maintenance primitives are the norm. We also need to ensure we 
> prevent dangerous scenarios like the Con listed for option #2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to