Sagar Sadashiv Patwardhan created MESOS-7882:
Summary: Mesos master rescinds all the in-flight offers from all
the registered agents when a new maintenance schedule is posted for a subset of
Issue Type: Bug
Affects Versions: 1.3.0
Environment: Ubuntu 14:04(trusty)
Mesos master branch.
Reporter: Sagar Sadashiv Patwardhan
We are running mesos 1.1.0 in production. We use a custom autoscaler for
scaling our mesos cluster up and down. While scaling down the cluster,
autoscaler makes a POST request to mesos master /maintenance/schedule endpoint
with a set of slaves to move to maintenance mode. This forces mesos master to
rescind all the in-flight offers from *all the slaves* in the cluster. If our
scheduler accepts one of these offers, then we get a TASK_LOST status update
back for that task. We also see such
(https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines
in mesos master logs.
After reading the code(refs:
appears that offers are getting rescinded for all the slaves. I am not sure
what is the expected behavior here, but it makes more sense if only resources
from slaves marked for maintenance are reclaimed.
To verify that it is actually happening, I checked out the master branch(sha:
a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log
Built the binary and started a mesos master and 2 agent processes. Used a basic
python framework that launches docker containers on these slaves. Verified that
there is no existing schedule for any slaves using `curl
10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of
after starting the mesos framework.
I think mesos should rescind offers and inverse offers only for those slaves
that are marked for maintenance(draining mode).
This message was sent by Atlassian JIRA