[ 
https://issues.apache.org/jira/browse/MESOS-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194534#comment-16194534
 ] 

Alexander Rukletsov commented on MESOS-7997:
--------------------------------------------

I was able to reproduce an instance of this issue locally with some extra logs 
added. As we can see below, sometimes {{updateAvailable}} call can sneak 
in-between and lead to an offer without persistent volume, even though we know 
the volume has been created. Compare offer from a "bad" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}
and from a "good" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":960.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"},{"allocation_info":{"role":"role1"},"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}

A closer look at a "bad" run reveals a race. An agent [sends an 
update|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1260-L1265]
 with its resources after it got registered with the master, which arrives on 
the master _after_ a persistent volume has been created. We can fix the test by 
waiting for the {{UpdateSlaveMessage}} to arrive before requesting a volume, 
but I think we should fix the race in the agent-master communication. A change 
has been likely introduced in 
[2af9a5b07dc80151154264e974d03f56a1c25838|https://github.com/apache/mesos/commit/2af9a5b07dc80151154264e974d03f56a1c25838]
{noformat}
I1005 17:30:05.343058 2138112 hierarchical.cpp:593] Added agent 
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 (localhost) with disk(reservations: 
[(STATIC,role1)]):1024; cpus:8; mem:15360; ports:[31000-32000] (allocated: {})
I1005 17:30:05.343085 528384 status_update_manager.cpp:184] Resuming sending 
status updates
I1005 17:30:05.342851 4284416 master.cpp:6032] Registered agent 
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 
(localhost) with 
[{"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
I1005 17:30:05.343230 2138112 hierarchical.cpp:1945] No allocations performed
I1005 17:30:05.343252 2138112 hierarchical.cpp:1488] Performed allocation for 1 
agents in 94us
I1005 17:30:05.344882 3747840 process.cpp:3929] Handling HTTP event for process 
'master' with path: '/master/api/v1'
I1005 17:30:05.345293 2674688 slave.cpp:1213] Checkpointing SlaveInfo to 
'/var/folders/h3/8j18s1cx2bn78ms99d3lz4jh0000gq/T/ContentType_MasterAPITest_CreateAndDestroyVolumes_1_RWSUpI/meta/slaves/c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0/slave.info'
I1005 17:30:05.345422 2138112 http.cpp:1185] HTTP POST for /master/api/v1 from 
127.0.0.1:61576
I1005 17:30:05.345844 2138112 http.cpp:673] Processing call CREATE_VOLUMES
I1005 17:30:05.346055 2138112 master.cpp:3758] Authorizing principal 
'test-principal' to create volumes 
'[{"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]'
I1005 17:30:05.347273 4284416 master.cpp:9335] Sending updated checkpointed 
resources disk(reservations: [(STATIC,role1)])[id1:path1]:64 to agent 
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 
(localhost)
W1005 17:30:05.347931 2025730048 sched.cpp:1711] 
**************************************************
Scheduler driver bound to loopback interface! Cannot communicate with remote 
master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a 
routable IP address.
**************************************************
W1005 17:30:05.347995 2025730048 process.cpp:3194] Attempted to spawn already 
running process [email protected]:60330
I1005 17:30:05.348299 2674688 slave.cpp:1263] Forwarding total resources 
disk(reservations: [(STATIC,role1)]):1024; cpus:8; mem:15360; 
ports:[31000-32000]
I1005 17:30:05.348356 2674688 slave.cpp:1272] Forwarding total oversubscribed 
resources {}
I1005 17:30:05.348551 2138112 master.cpp:6804] Received update of agent 
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 
(localhost) with total resources disk(reservations: [(STATIC,role1)]):1024; 
cpus:8; mem:15360; ports:[31000-32000]
I1005 17:30:05.348595 2138112 master.cpp:6817] Received update of agent 
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330 
(localhost) with total oversubscribed resources {}
I1005 17:30:05.349814 2674688 slave.cpp:3535] Updated checkpointed resources 
from {} to disk(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}

> ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.
> -----------------------------------------------------------
>
>                 Key: MESOS-7997
>                 URL: https://issues.apache.org/jira/browse/MESOS-7997
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.5.0
>         Environment: Ubuntu 17.04 with SSL
>            Reporter: Alexander Rukletsov
>              Labels: flaky-test, mesosphere
>         Attachments: CreateAndDestroyVolumes_badrun.txt, 
> CreateAndDestroyVolumes_goodrun.txt
>
>
> Observed a failure on the internal CI:
> {noformat}
> ../../src/tests/api_tests.cpp:3052
> Value of: Resources(offer.resources()).contains( allocatedResources(volume, 
> frameworkInfo.role()))
>   Actual: false
> Expected: true
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to