[
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031854#comment-14031854
]
Nikita Vetoshkin commented on MESOS-1474:
-----------------------------------------
Just a quick note about:
{quote}
Provide a way for frameworks to be notified when resources are requested to be
relinquished
{quote}
If I'm not mistaken, John Wilkes in one of his Omega talks mentioned that
everything in Omega is "a scheduling event", i.e. event like "this datacenter
will go down in a month for two days" is a scheduling event and frameworks can
take action if they want to. Maybe something similar should do the trick in
Mesos too.
> Provide cluster maintenance primitives for operators.
> -----------------------------------------------------
>
> Key: MESOS-1474
> URL: https://issues.apache.org/jira/browse/MESOS-1474
> Project: Mesos
> Issue Type: Epic
> Components: framework, master, slave
> Reporter: Benjamin Mahler
>
> Normally cluster upgrades can be done seamlessly using the built-in slave
> recovery feature. However, there are situations where operators want to be
> able to perform destructive maintenance operations on machines:
> * Non-recoverable slave upgrades.
> * Machine reboots.
> * Kernel upgrades.
> * etc.
> In these situations, best practice is to perform rolling maintenance in large
> batches of machines. This can be problematic for frameworks when many related
> tasks are located within a batch of machines going for maintenance.
> There are a few primitives of interest here:
> * Provide a way for operators to fully shutdown a slave (killing all tasks
> underneath it).
> * Provide a way for operators to mark specific slaves as undergoing
> maintenance. This means that no more offers are being sent for these slaves,
> and no new tasks will launch on them.
> * Provide a way for frameworks to be notified when resources are requested to
> be relinquished. This gives the framework to proactively move a task before
> it is forcibly killed. It also allows the automation of operations like:
> "please drain these slaves within 1 hour."
--
This message was sent by Atlassian JIRA
(v6.2#6252)