[ 
https://issues.apache.org/jira/browse/MESOS-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354281#comment-16354281
 ] 

Till Toenshoff commented on MESOS-8463:
---------------------------------------

The first thing that sticks out when reading the logs is the fact that we see 
two scheduled re-registration attempts. The trigger here is the lack of a lower 
boundary for agent re-registration delay. 

The first DoReliableRegistration re-registration delay log;
{noformat}
I0118 21:44:45.659322  1093 slave.cpp:1673] Will retry registration in 925295ns 
if necessary
{noformat}

Just as scheduled we see another log telling us that we hit 
{{doReliableRegistration}} again about 0.001 seconds later.
{noformat}
I0118 21:44:45.660614  1105 slave.cpp:1673] Will retry registration in 
2.182634ms if necessary
{noformat}

The first registration now succeeds with all bells and whistles - this is S0. 
{noformat}
I0118 21:44:45.659622  1103 master.cpp:6233] Registering agent at 
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id 
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0
{noformat}

Next we see an {{addSlave}} on the allocator which will satisfy the expectation 
from our test for the first time (which is entirely).
{noformat}
I0118 21:44:45.660295  1093 hierarchical.cpp:574] Added agent 
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S0 (ip-172-16-10-65.ec2.internal) with 
cpus:2; mem:1024; ports:[31000-32000] (allocated: {})
{noformat}

The rest of our test now quickly gets processed until it is done - scheduler 
and the slave are shutting down.
{noformat}
I0118 21:44:45.661115  1071 sched.cpp:2009] Asked to stop the driver
I0118 21:44:45.661146  1097 sched.cpp:1191] Stopping framework 
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-0000
I0118 21:44:45.661387  1071 slave.cpp:931] Agent terminating
{noformat} 

We still have that second reregistration in progress and that succeeds 
eventually as it came without an ID - we assign a new slaveID in the master - 
this is S1
{noformat}
I0118 21:44:45.664952  1107 master.cpp:6233] Registering agent at 
slave(144)@172.16.10.65:40262 (ip-172-16-10-65.ec2.internal) with id 
1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1
{noformat}

The entire chain of events continues until we finally reach {{addSlave}} on the 
allocator again which will oversaturate our expectation now.

The lack of a re-registration delay minimum triggers the problem. The problem 
should only manifest in tests with expectations that do not match the results 
of a possibly tiny delay between re-registrations.

> Test MasterAllocatorTest/1.SingleFramework is flaky
> ---------------------------------------------------
>
>                 Key: MESOS-8463
>                 URL: https://issues.apache.org/jira/browse/MESOS-8463
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, test
>    Affects Versions: 1.5.0
>            Reporter: Benjamin Bannier
>            Assignee: Till Toenshoff
>            Priority: Major
>              Labels: flaky-test
>         Attachments: consoleText.txt
>
>
> Observed in our internal CI on a ubuntu-16 setup in a plain autotools build,
> {noformat}
> ../../src/tests/master_allocator_tests.cpp:175
> Mock function called more times than expected - taking default action 
> specified at:
> ../../src/tests/allocator.hpp:273:
>     Function call: addSlave(@0x7fe8dc03d0e8 
> 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname: 
> "ip-172-16-10-65.ec2.internal"
> resources {
>   name: "cpus"
>   type: SCALAR
>   scalar {
>     value: 2
>   }
> }
> resources {
>   name: "mem"
>   type: SCALAR
>   scalar {
>     value: 1024
>   }
> }
> resources {
>   name: "ports"
>   type: RANGES
>   ranges {
>     range {
>       begin: 31000
>       end: 32000
>     }
>   }
> }
> id {
>   value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1"
> }
> checkpoint: true
> port: 40262
> , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 
> 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94 
> 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00 
> 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 
> 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte 
> object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F 
> 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] }, 
> @0x7fe8dc03d460 {})
>          Expected: to be called once
>            Actual: called twice - over-saturated and active
> Stacktrace
> ../../src/tests/master_allocator_tests.cpp:175
> Mock function called more times than expected - taking default action 
> specified at:
> ../../src/tests/allocator.hpp:273:
>     Function call: addSlave(@0x7fe8dc03d0e8 
> 1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1, @0x7fe8dc03d108 hostname: 
> "ip-172-16-10-65.ec2.internal"
> resources {
>   name: "cpus"
>   type: SCALAR
>   scalar {
>     value: 2
>   }
> }
> resources {
>   name: "mem"
>   type: SCALAR
>   scalar {
>     value: 1024
>   }
> }
> resources {
>   name: "ports"
>   type: RANGES
>   ranges {
>     range {
>       begin: 31000
>       end: 32000
>     }
>   }
> }
> id {
>   value: "1eb6ab2c-293d-4b99-b76b-87bd939a1a19-S1"
> }
> checkpoint: true
> port: 40262
> , @0x7fe8ffa276c0 { 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 
> 00-00 01-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00>, 32-byte object <48-94 
> 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 02-00 00-00 
> 00-00 00-00>, 32-byte object <48-94 7D-0E E9-7F 00-00 00-00 00-00 00-00 00-00 
> 01-00 00-00 00-00 00-00 03-00 00-00 73-79 73-74> }, @0x7fe8ffa27720 48-byte 
> object <01-00 00-00 E8-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 08-7A A2-FF E8-7F 00-00 A0-32 24-7D 62-55 00-00 DE-3C 11-0A E9-7F 
> 00-00>, @0x7fe8dc03d4c8 { cpus:2, mem:1024, ports:[31000-32000] }, 
> @0x7fe8dc03d460 {})
>          Expected: to be called once
>            Actual: called twice - over-saturated and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to