[
https://issues.apache.org/jira/browse/MESOS-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354281#comment-16354281
]
Till Toenshoff edited comment on MESOS-8463 at 2/7/18 12:11 AM:
----------------------------------------------------------------
The first thing that sticks out when reading the logs is the fact that we see
two scheduled re-registration attempts. The trigger here is the lack of a lower
boundary for agent re-registration delay.
The first DoReliableRegistration re-registration delay log entry tells us that
we should just have sent a registration message to the master - our first one.
{noformat}
I0118 21:44:45.659322 1093 slave.cpp:1673] Will retry registration in 925295ns
if necessary
{noformat}
Just as scheduled we see another log telling us that we hit
{{doReliableRegistration}} again about 0.001 seconds later. We now sent the
second registration request.
{noformat}
I0118 21:44:45.660614 1105 slave.cpp:1673] Will retry registration in
2.182634ms if necessary
{noformat}
The first registration now succeeds with all bells and whistles - this is S0.
{noformat}
I0118 21:44:45.659622 1103 master.cpp:6233] Registering agent at
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0
{noformat}
Next we see an {{addSlave}} on the allocator which will satisfy the expectation
from our test for the first time (which is entirely).
{noformat}
I0118 21:44:45.660295 1093 hierarchical.cpp:574] Added agent
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0 (ip-172-16-10-65.ec2.internal) with
cpus:2; mem:1024; ports:[31000-32000] (allocated: {})
{noformat}
The rest of our test now quickly gets processed until it is done - scheduler
and the slave are shutting down.
{noformat}
I0118 21:44:45.661115 1071 sched.cpp:2009] Asked to stop the driver
I0118 21:44:45.661146 1097 sched.cpp:1191] Stopping framework
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-0000
I0118 21:44:45.661387 1071 slave.cpp:931] Agent terminating
{noformat}
We still have that second re-registration in progress and that succeeds
eventually. It came without a slaveID so we assign a new slaveID in the master
- this now is a new slave, S1.
{noformat}
I0118 21:44:45.664952 1107 master.cpp:6233] Registering agent at
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1
{noformat}
The entire chain of events continues until we finally reach {{addSlave}} on the
allocator again which will oversaturate our expectation now.
The lack of a re-registration delay minimum triggers the problem in combination
with a strict expectation on registration-chained events of this test (and
likely others as well).
was (Author: tillt):
The first thing that sticks out when reading the logs is the fact that we see
two scheduled re-registration attempts. The trigger here is the lack of a lower
boundary for agent re-registration delay.
The first DoReliableRegistration re-registration delay log;
{noformat}
I0118 21:44:45.659322 1093 slave.cpp:1673] Will retry registration in 925295ns
if necessary
{noformat}
Just as scheduled we see another log telling us that we hit
{{doReliableRegistration}} again about 0.001 seconds later.
{noformat}
I0118 21:44:45.660614 1105 slave.cpp:1673] Will retry registration in
2.182634ms if necessary
{noformat}
The first registration now succeeds with all bells and whistles - this is S0.
{noformat}
I0118 21:44:45.659622 1103 master.cpp:6233] Registering agent at
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0
{noformat}
Next we see an {{addSlave}} on the allocator which will satisfy the expectation
from our test for the first time (which is entirely).
{noformat}
I0118 21:44:45.660295 1093 hierarchical.cpp:574] Added agent
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0 (ip-172-16-10-65.ec2.internal) with
cpus:2; mem:1024; ports:[31000-32000] (allocated: {})
{noformat}
The rest of our test now quickly gets processed until it is done - scheduler
and the slave are shutting down.
{noformat}
I0118 21:44:45.661115 1071 sched.cpp:2009] Asked to stop the driver
I0118 21:44:45.661146 1097 sched.cpp:1191] Stopping framework
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-0000
I0118 21:44:45.661387 1071 slave.cpp:931] Agent terminating
{noformat}
We still have that second reregistration in progress and that succeeds
eventually as it came without an ID - we assign a new slaveID in the master -
this is S1
{noformat}
I0118 21:44:45.664952 1107 master.cpp:6233] Registering agent at
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1
{noformat}
The entire chain of events continues until we finally reach {{addSlave}} on the
allocator again which will oversaturate our expectation now.
The lack of a re-registration delay minimum triggers the problem. The problem
should only manifest in tests with expectations that do not match the results
of a possibly tiny delay between re-registrations.
> Test MasterAllocatorTest/1.SingleFramework is flaky
> ---------------------------------------------------
>
> Key: MESOS-8463
> URL: https://issues.apache.org/jira/browse/MESOS-8463
> Project: Mesos
> Issue Type: Bug
> Components: allocation, test
> Affects Versions: 1.5.0
> Reporter: Benjamin Bannier
> Assignee: Till Toenshoff
> Priority: Major
> Labels: flaky-test
> Attachments: consoleText.txt
>
>
> Observed in our internal CI on a ubuntu-16 setup in a plain autotools build,
> {noformat}
> ../../src/tests/master_allocator_tests.cpp:175
> Mock function called more times than expected - taking default action
> specified at:
> ../../src/tests/allocator.hpp:273:
> Function call: addSlave(@0x7fe8dc03d0e8
> 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname:
> "ip-172-16-10-65.ec2.internal"
> resources {
> name: "cpus"
> type: SCALAR
> scalar {
> value: 2
> }
> }
> resources {
> name: "mem"
> type: SCALAR
> scalar {
> value: 1024
> }
> }
> resources {
> name: "ports"
> type: RANGES
> ranges {
> range {
> begin: 31000
> end: 32000
> }
> }
> }
> id {
> value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1"
> }
> checkpoint: true
> port: 40262
> , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00
> 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94
> 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00
> 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00
> 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte
> object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00
> 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F
> 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] },
> @0x7fe8dc03d460 {})
> Expected: to be called once
> Actual: called twice - over-saturated and active
> Stacktrace
> ../../src/tests/master_allocator_tests.cpp:175
> Mock function called more times than expected - taking default action
> specified at:
> ../../src/tests/allocator.hpp:273:
> Function call: addSlave(@0x7fe8dc03d0e8
> 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname:
> "ip-172-16-10-65.ec2.internal"
> resources {
> name: "cpus"
> type: SCALAR
> scalar {
> value: 2
> }
> }
> resources {
> name: "mem"
> type: SCALAR
> scalar {
> value: 1024
> }
> }
> resources {
> name: "ports"
> type: RANGES
> ranges {
> range {
> begin: 31000
> end: 32000
> }
> }
> }
> id {
> value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1"
> }
> checkpoint: true
> port: 40262
> , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00
> 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94
> 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00
> 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00
> 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte
> object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00
> 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F
> 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] },
> @0x7fe8dc03d460 {})
> Expected: to be called once
> Actual: called twice - over-saturated and active
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)