[ 
https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sadashiv Patwardhan updated MESOS-7882:
---------------------------------------------
    Description: 
We are running mesos 1.1.0 in production. We use a custom autoscaler for 
scaling our mesos  cluster up and down. While scaling down the cluster, 
autoscaler makes a POST request to mesos master /maintenance/schedule endpoint 
with a set of slaves to move to maintenance mode. This forces mesos master to 
rescind all the in-flight offers from *all the slaves* in the cluster. If our 
scheduler accepts one of these offers, then we get a TASK_LOST status update 
back for that task. We also see such 
(https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines 
in mesos master logs.

After reading the code(refs: 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
appears that offers are getting rescinded for all the slaves. I am not sure 
what is the expected behavior here, but it makes more sense if only resources 
from slaves marked for maintenance are reclaimed.

*Experiment:*
To verify that it is actually happening, I checked out the master branch(sha: 
a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
Built the binary and started a mesos master and 2 agent processes. Used a basic 
python framework that launches docker containers on these slaves. Verified that 
there is no existing schedule for any slaves using `curl 
10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
after starting the mesos framework.

*Logs:*
mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
Mesos framework: 
https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a

I think mesos should rescind offers and inverse offers only for those slaves 
that are marked for maintenance(draining mode).

  was:
We are running mesos 1.1.0 in production. We use a custom autoscaler for 
scaling our mesos  cluster up and down. While scaling down the cluster, 
autoscaler makes a POST request to mesos master /maintenance/schedule endpoint 
with a set of slaves to move to maintenance mode. This forces mesos master to 
rescind all the in-flight offers from *all the slaves* in the cluster. If our 
scheduler accepts one of these offers, then we get a TASK_LOST status update 
back for that task. We also see such 
(https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines 
in mesos master logs.

After reading the code(refs: 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
appears that offers are getting rescinded for all the slaves. I am not sure 
what is the expected behavior here, but it makes more sense if only resources 
from slaves marked for maintenance are reclaimed.

Experiment:
To verify that it is actually happening, I checked out the master branch(sha: 
a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
Built the binary and started a mesos master and 2 agent processes. Used a basic 
python framework that launches docker containers on these slaves. Verified that 
there is no existing schedule for any slaves using `curl 
10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
after starting the mesos framework.

Logs:
mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
Mesos framework: 
https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a

I think mesos should rescind offers and inverse offers only for those slaves 
that are marked for maintenance(draining mode).


> Mesos master rescinds all the in-flight offers from all the registered agents 
> when a new maintenance schedule is posted for a subset of slaves
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-7882
>                 URL: https://issues.apache.org/jira/browse/MESOS-7882
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.0
>         Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
>            Reporter: Sagar Sadashiv Patwardhan
>            Priority: Minor
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for 
> scaling our mesos  cluster up and down. While scaling down the cluster, 
> autoscaler makes a POST request to mesos master /maintenance/schedule 
> endpoint with a set of slaves to move to maintenance mode. This forces mesos 
> master to rescind all the in-flight offers from *all the slaves* in the 
> cluster. If our scheduler accepts one of these offers, then we get a 
> TASK_LOST status update back for that task. We also see such 
> (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log 
> lines in mesos master logs.
> After reading the code(refs: 
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
> appears that offers are getting rescinded for all the slaves. I am not sure 
> what is the expected behavior here, but it makes more sense if only resources 
> from slaves marked for maintenance are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha: 
> a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
> lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
> Built the binary and started a mesos master and 2 agent processes. Used a 
> basic python framework that launches docker containers on these slaves. 
> Verified that there is no existing schedule for any slaves using `curl 
> 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
> the 
> slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
> after starting the mesos framework.
> *Logs:*
> mesos-master: 
> https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1: 
> https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2: 
> https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework: 
> https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves 
> that are marked for maintenance(draining mode).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to