[ https://issues.apache.org/jira/browse/MESOS-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354281#comment-16354281 ]
Till Toenshoff commented on MESOS-8463: --------------------------------------- The first thing that sticks out when reading the logs is the fact that we see two scheduled re-registration attempts. The trigger here is the lack of a lower boundary for agent re-registration delay. The first DoReliableRegistration re-registration delay log; {noformat} I0118 21:44:45.659322 1093 slave.cpp:1673] Will retry registration in 925295ns if necessary {noformat} Just as scheduled we see another log telling us that we hit {{doReliableRegistration}} again about 0.001 seconds later. {noformat} I0118 21:44:45.660614 1105 slave.cpp:1673] Will retry registration in 2.182634ms if necessary {noformat} The first registration now succeeds with all bells and whistles - this is S0. {noformat} I0118 21:44:45.659622 1103 master.cpp:6233] Registering agent at slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0 {noformat} Next we see an {{addSlave}} on the allocator which will satisfy the expectation from our test for the first time (which is entirely). {noformat} I0118 21:44:45.660295 1093 hierarchical.cpp:574] Added agent 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0 (ip-172-16-10-65.ec2.internal) with cpus:2; mem:1024; ports:[31000-32000] (allocated: {}) {noformat} The rest of our test now quickly gets processed until it is done - scheduler and the slave are shutting down. {noformat} I0118 21:44:45.661115 1071 sched.cpp:2009] Asked to stop the driver I0118 21:44:45.661146 1097 sched.cpp:1191] Stopping framework 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-0000 I0118 21:44:45.661387 1071 slave.cpp:931] Agent terminating {noformat} We still have that second reregistration in progress and that succeeds eventually as it came without an ID - we assign a new slaveID in the master - this is S1 {noformat} I0118 21:44:45.664952 1107 master.cpp:6233] Registering agent at slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1 {noformat} The entire chain of events continues until we finally reach {{addSlave}} on the allocator again which will oversaturate our expectation now. The lack of a re-registration delay minimum triggers the problem. The problem should only manifest in tests with expectations that do not match the results of a possibly tiny delay between re-registrations. > Test MasterAllocatorTest/1.SingleFramework is flaky > --------------------------------------------------- > > Key: MESOS-8463 > URL: https://issues.apache.org/jira/browse/MESOS-8463 > Project: Mesos > Issue Type: Bug > Components: allocation, test > Affects Versions: 1.5.0 > Reporter: Benjamin Bannier > Assignee: Till Toenshoff > Priority: Major > Labels: flaky-test > Attachments: consoleText.txt > > > Observed in our internal CI on a ubuntu-16 setup in a plain autotools build, > {noformat} > ../../src/tests/master_allocator_tests.cpp:175 > Mock function called more times than expected - taking default action > specified at: > ../../src/tests/allocator.hpp:273: > Function call: addSlave(@0x7fe8dc03d0e8 > 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname: > "ip-172-16-10-65.ec2.internal" > resources { > name: "cpus" > type: SCALAR > scalar { > value: 2 > } > } > resources { > name: "mem" > type: SCALAR > scalar { > value: 1024 > } > } > resources { > name: "ports" > type: RANGES > ranges { > range { > begin: 31000 > end: 32000 > } > } > } > id { > value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1" > } > checkpoint: true > port: 40262 > , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 > 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94 > 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00 > 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 > 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte > object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F > 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] }, > @0x7fe8dc03d460 {}) > Expected: to be called once > Actual: called twice - over-saturated and active > Stacktrace > ../../src/tests/master_allocator_tests.cpp:175 > Mock function called more times than expected - taking default action > specified at: > ../../src/tests/allocator.hpp:273: > Function call: addSlave(@0x7fe8dc03d0e8 > 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname: > "ip-172-16-10-65.ec2.internal" > resources { > name: "cpus" > type: SCALAR > scalar { > value: 2 > } > } > resources { > name: "mem" > type: SCALAR > scalar { > value: 1024 > } > } > resources { > name: "ports" > type: RANGES > ranges { > range { > begin: 31000 > end: 32000 > } > } > } > id { > value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1" > } > checkpoint: true > port: 40262 > , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 > 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94 > 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00 > 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 > 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte > object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F > 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] }, > @0x7fe8dc03d460 {}) > Expected: to be called once > Actual: called twice - over-saturated and active > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)