[
https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126305#comment-16126305
]
Huadong Liu edited comment on MESOS-7882 at 8/14/17 7:55 PM:
-------------------------------------------------------------
I was able to repro the problem. The test setup has two mesos agents
{noformat}
af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153
af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14
{noformat}
The modified example framework is going to hold received offers for 30 seconds
and it only launches tasks on S0.
{noformat}
diff --git a/src/examples/python/test_framework.py
b/src/examples/python/test_framework.py
def resourceOffers(self, driver, offers):
+ time.sleep(30)
for offer in offers:
+ if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' ==
offer.slave_id.value:
+ print("ignore offers from
af584a07-7b1c-4955-861e-63585af8bb5d-S1")
+ continue
tasks = []
{noformat}
Start test-framework, and post a maintenance schedule of S1 on another terminal
while test-framework is in sleep.
{noformat}
~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050
I0814 11:48:21.296404 4182 sched.cpp:232] Version: 1.3.0
I0814 11:48:21.301652 4222 sched.cpp:336] New master detected at
[email protected]:5050
I0814 11:48:21.302145 4222 sched.cpp:352] No credentials provided. Attempting
to register without authentication
I0814 11:48:21.306299 4224 sched.cpp:759] Framework registered with
af584a07-7b1c-4955-861e-63585af8bb5d-0014
Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014
---------------------
$ cat schedule.json
{
"windows" : [
{
"machine_ids" : [
{ "ip" : "10.255.52.14" }
],
"unavailability" : {
"start" : { "nanoseconds" : 1502734375000000000 },
"duration" : { "nanoseconds" : 3600000000000 }
}
}
]
}
$ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type:
application/json" -X POST -d @schedule.json
----------------
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and
mem: 2927.0
Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and
mem: 2927.0
Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
W0814 11:49:51.406801 4218 sched.cpp:1371] Attempting to accept an unknown
offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Task 0 is in state TASK_LOST
{noformat}
Mesos master log while this is happening is captured below:
{noformat}
I0814 11:48:21.302987 1530 master.cpp:2596] Received SUBSCRIBE call for
framework 'Test Framework (Python)' at
[email protected]:45893
I0814 11:48:21.303450 1530 master.cpp:2672] Subscribing framework Test
Framework (Python) with checkpointing enabled and capabilities [ ]
I0814 11:48:21.304566 1529 hierarchical.cpp:275] Added framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014
I0814 11:48:21.306139 1530 master.cpp:6517] Sending 2 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
I0814 11:48:25.076035 1533 http.cpp:391] HTTP POST for
/master/maintenance/schedule from 10.255.55.153:37186 with
User-Agent='curl/7.47.0'
I0814 11:48:25.077271 1533 registrar.cpp:461] Applied 1 operations in
272915ns; attempting to update the registry
I0814 11:48:25.078277 1533 coordinator.cpp:348] Coordinator attempting to
write APPEND action at position 39
I0814 11:48:25.079033 1533 replica.cpp:537] Replica received write request for
position 39 from __req_res__(44)@10.255.52.14:5050
I0814 11:48:25.082299 1531 replica.cpp:691] Replica received learned notice
for position 39 from @0.0.0.0:0
I0814 11:48:25.085546 1531 registrar.cpp:506] Successfully updated the
registry in 8.176128ms
I0814 11:48:25.085726 1535 coordinator.cpp:348] Coordinator attempting to
write TRUNCATE action at position 40
I0814 11:48:25.086496 1528 master.cpp:5645] Removing unavailability of agent
af584a07-7b1c-4955-861e-63585af8bb5d-S1 at slave(1)@10.255.52.14:5051
(10.255.52.14)
I0814 11:48:25.086550 1530 replica.cpp:537] Replica received write request for
position 40 from __req_res__(45)@10.255.52.14:5050
I0814 11:48:25.087936 1530 replica.cpp:691] Replica received learned notice
for position 40 from @0.0.0.0:0
I0814 11:48:25.088673 1528 master.cpp:5645] Removing unavailability of agent
af584a07-7b1c-4955-861e-63585af8bb5d-S0 at slave(1)@10.255.55.153:5051
(10.255.55.153)
I0814 11:48:25.089725 1528 master.cpp:6517] Sending 1 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
I0814 11:48:25.090461 1529 master.cpp:6517] Sending 1 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
W0814 11:49:51.408465 1534 master.cpp:3494] Ignoring accept of offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 since it is no longer valid
W0814 11:49:51.408888 1534 master.cpp:3505] ACCEPT call used invalid offers '[
af584a07-7b1c-4955-861e-63585af8bb5d-O153 ]': Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid
I0814 11:49:51.409276 1534 master.cpp:5772] Sending status update TASK_LOST
for task 0 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
I0814 11:49:51.409920 1534 master.cpp:5772] Sending status update TASK_LOST
for task 1 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
I0814 11:49:51.410332 1534 master.cpp:5772] Sending status update TASK_LOST
for task 2 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
{noformat}
was (Author: huadongliu):
I was able to repro the problem. My setup has two mesos agents
{noformat}
af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153
af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14
{noformat}
The modified example framework is going to hold received offers for 30 seconds
and only launch tasks on S0.
{noformat}
diff --git a/src/examples/python/test_framework.py
b/src/examples/python/test_framework.py
def resourceOffers(self, driver, offers):
+ time.sleep(30)
for offer in offers:
+ if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' ==
offer.slave_id.value:
+ print("ignore offers from
af584a07-7b1c-4955-861e-63585af8bb5d-S1")
+ continue
tasks = []
{noformat}
Start test-framework, and post a maintenance schedule of S1 on another terminal
while the framework is in sleep.
{noformat}
~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050
I0814 11:48:21.296404 4182 sched.cpp:232] Version: 1.3.0
I0814 11:48:21.301652 4222 sched.cpp:336] New master detected at
[email protected]:5050
I0814 11:48:21.302145 4222 sched.cpp:352] No credentials provided. Attempting
to register without authentication
I0814 11:48:21.306299 4224 sched.cpp:759] Framework registered with
af584a07-7b1c-4955-861e-63585af8bb5d-0014
Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014
---------------------
$ cat schedule.json
{
"windows" : [
{
"machine_ids" : [
{ "ip" : "10.255.52.14" }
],
"unavailability" : {
"start" : { "nanoseconds" : 1502734375000000000 },
"duration" : { "nanoseconds" : 3600000000000 }
}
}
]
}
$ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type:
application/json" -X POST -d @schedule.json
----------------
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and
mem: 2927.0
Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and
mem: 2927.0
Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
W0814 11:49:51.406801 4218 sched.cpp:1371] Attempting to accept an unknown
offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Task 0 is in state TASK_LOST
{noformat}
Mesos master logs while this is happening:
{noformat}
I0814 11:48:21.302987 1530 master.cpp:2596] Received SUBSCRIBE call for
framework 'Test Framework (Python)' at
[email protected]:45893
I0814 11:48:21.303450 1530 master.cpp:2672] Subscribing framework Test
Framework (Python) with checkpointing enabled and capabilities [ ]
I0814 11:48:21.304566 1529 hierarchical.cpp:275] Added framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014
I0814 11:48:21.306139 1530 master.cpp:6517] Sending 2 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
I0814 11:48:25.076035 1533 http.cpp:391] HTTP POST for
/master/maintenance/schedule from 10.255.55.153:37186 with
User-Agent='curl/7.47.0'
I0814 11:48:25.077271 1533 registrar.cpp:461] Applied 1 operations in
272915ns; attempting to update the registry
I0814 11:48:25.078277 1533 coordinator.cpp:348] Coordinator attempting to
write APPEND action at position 39
I0814 11:48:25.079033 1533 replica.cpp:537] Replica received write request for
position 39 from __req_res__(44)@10.255.52.14:5050
I0814 11:48:25.082299 1531 replica.cpp:691] Replica received learned notice
for position 39 from @0.0.0.0:0
I0814 11:48:25.085546 1531 registrar.cpp:506] Successfully updated the
registry in 8.176128ms
I0814 11:48:25.085726 1535 coordinator.cpp:348] Coordinator attempting to
write TRUNCATE action at position 40
I0814 11:48:25.086496 1528 master.cpp:5645] Removing unavailability of agent
af584a07-7b1c-4955-861e-63585af8bb5d-S1 at slave(1)@10.255.52.14:5051
(10.255.52.14)
I0814 11:48:25.086550 1530 replica.cpp:537] Replica received write request for
position 40 from __req_res__(45)@10.255.52.14:5050
I0814 11:48:25.087936 1530 replica.cpp:691] Replica received learned notice
for position 40 from @0.0.0.0:0
I0814 11:48:25.088673 1528 master.cpp:5645] Removing unavailability of agent
af584a07-7b1c-4955-861e-63585af8bb5d-S0 at slave(1)@10.255.55.153:5051
(10.255.55.153)
I0814 11:48:25.089725 1528 master.cpp:6517] Sending 1 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
I0814 11:48:25.090461 1529 master.cpp:6517] Sending 1 offers to framework
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at
[email protected]:45893
W0814 11:49:51.408465 1534 master.cpp:3494] Ignoring accept of offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 since it is no longer valid
W0814 11:49:51.408888 1534 master.cpp:3505] ACCEPT call used invalid offers '[
af584a07-7b1c-4955-861e-63585af8bb5d-O153 ]': Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid
I0814 11:49:51.409276 1534 master.cpp:5772] Sending status update TASK_LOST
for task 0 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
I0814 11:49:51.409920 1534 master.cpp:5772] Sending status update TASK_LOST
for task 1 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
I0814 11:49:51.410332 1534 master.cpp:5772] Sending status update TASK_LOST
for task 2 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
is no longer valid'
{noformat}
> Mesos master rescinds all the in-flight offers from all the registered agents
> when a new maintenance schedule is posted for a subset of slaves
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-7882
> URL: https://issues.apache.org/jira/browse/MESOS-7882
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.3.0
> Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
> Reporter: Sagar Sadashiv Patwardhan
> Priority: Minor
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for
> scaling our mesos cluster up and down. While scaling down the cluster,
> autoscaler makes a POST request to mesos master /maintenance/schedule
> endpoint with a set of slaves to move to maintenance mode. This forces mesos
> master to rescind all the in-flight offers from *all the slaves* in the
> cluster. If our scheduler accepts one of these offers, then we get a
> TASK_LOST status update back for that task. We also see such
> (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log
> lines in mesos master logs.
> After reading the code(refs:
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it
> appears that offers are getting rescinded for all the slaves. I am not sure
> what is the expected behavior here, but it makes more sense if only resources
> from slaves marked for maintenance are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha:
> a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log
> lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3).
> Built the binary and started a mesos master and 2 agent processes. Used a
> basic python framework that launches docker containers on these slaves.
> Verified that there is no existing schedule for any slaves using `curl
> 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of
> the
> slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0)
> after starting the mesos framework.
> *Logs:*
> mesos-master:
> https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1:
> https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2:
> https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework:
> https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves
> that are marked for maintenance(draining mode).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)