Joris Van Remoortere created MESOS-6828:
-------------------------------------------

             Summary: Consider ways for frameworks to ignore offers with an 
Unavailability
                 Key: MESOS-6828
                 URL: https://issues.apache.org/jira/browse/MESOS-6828
             Project: Mesos
          Issue Type: Improvement
            Reporter: Joris Van Remoortere
            Assignee: Artem Harutyunyan


Due to the opt-in nature of maintenance primitives in Mesos, there is a 
deficiency for cluster administrators when frameworks have not opted in.

An example case:
- Cluster with reasonable churn (tasks terminate naturally)
- Operator specifies maintenance schedule

Ideally *even* in a world where none of the frameworks had opted in to 
maintenance primitives the operator would have some way of preventing 
frameworks from scheduling further work on agents in the schedule. The natural 
termination of the tasks in the cluster would allow the nodes to drain 
gracefully and the operator to then perform maintenance.

2 options that have been discussed so far:
# Provide a capability for frameworks to automatically filter offers with an 
{{Unavailability}} set.
#* Pro: Finer grained control. Allows other frameworks to keep scheduling short 
lived tasks that can complete before the Unavailability.
#* Con: All frameworks have to be updated. Consider making this an environment 
variable to the scheduler driver for legacy frameworks.
# Provide a flag on the master to filter all offers with an {{Unavailability}} 
set.
#* Pro: Immediately actionable / usable.
#* Con: Coarse grained. Some frameworks may suffer efficiency.
#* Con: *Dangerous*: planning out a multi-day maintenance schedule for an 
entire cluster will prevent any frameworks from scheduling further work, 
potentially stalling the cluster.

Action Items: Provide further context for each option and consider others. We 
need to ensure we have something immediately consumable by users to fill the 
gap until maintenance primitives are the norm. We also need to ensure we 
prevent dangerous scenarios like the Con listed for option #2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to