[
https://issues.apache.org/jira/browse/MESOS-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194534#comment-16194534
]
Alexander Rukletsov commented on MESOS-7997:
--------------------------------------------
I was able to reproduce an instance of this issue locally with some extra logs
added. As we can see below, sometimes {{updateAvailable}} call can sneak
in-between and lead to an offer without persistent volume, even though we know
the volume has been created. Compare offer from a "bad" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}
and from a "good" run:
{noformat}
[{"allocation_info":{"role":"role1"},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":960.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"allocation_info":{"role":"role1"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"},{"allocation_info":{"role":"role1"},"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]
Allocated: disk(allocated: role1)(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}
A closer look at a "bad" run reveals a race. An agent [sends an
update|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1260-L1265]
with its resources after it got registered with the master, which arrives on
the master _after_ a persistent volume has been created. We can fix the test by
waiting for the {{UpdateSlaveMessage}} to arrive before requesting a volume,
but I think we should fix the race in the agent-master communication. A change
has been likely introduced in
[2af9a5b07dc80151154264e974d03f56a1c25838|https://github.com/apache/mesos/commit/2af9a5b07dc80151154264e974d03f56a1c25838]
{noformat}
I1005 17:30:05.343058 2138112 hierarchical.cpp:593] Added agent
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 (localhost) with disk(reservations:
[(STATIC,role1)]):1024; cpus:8; mem:15360; ports:[31000-32000] (allocated: {})
I1005 17:30:05.343085 528384 status_update_manager.cpp:184] Resuming sending
status updates
I1005 17:30:05.342851 4284416 master.cpp:6032] Registered agent
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330
(localhost) with
[{"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"name":"cpus","scalar":{"value":8.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":15360.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
I1005 17:30:05.343230 2138112 hierarchical.cpp:1945] No allocations performed
I1005 17:30:05.343252 2138112 hierarchical.cpp:1488] Performed allocation for 1
agents in 94us
I1005 17:30:05.344882 3747840 process.cpp:3929] Handling HTTP event for process
'master' with path: '/master/api/v1'
I1005 17:30:05.345293 2674688 slave.cpp:1213] Checkpointing SlaveInfo to
'/var/folders/h3/8j18s1cx2bn78ms99d3lz4jh0000gq/T/ContentType_MasterAPITest_CreateAndDestroyVolumes_1_RWSUpI/meta/slaves/c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0/slave.info'
I1005 17:30:05.345422 2138112 http.cpp:1185] HTTP POST for /master/api/v1 from
127.0.0.1:61576
I1005 17:30:05.345844 2138112 http.cpp:673] Processing call CREATE_VOLUMES
I1005 17:30:05.346055 2138112 master.cpp:3758] Authorizing principal
'test-principal' to create volumes
'[{"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]'
I1005 17:30:05.347273 4284416 master.cpp:9335] Sending updated checkpointed
resources disk(reservations: [(STATIC,role1)])[id1:path1]:64 to agent
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330
(localhost)
W1005 17:30:05.347931 2025730048 sched.cpp:1711]
**************************************************
Scheduler driver bound to loopback interface! Cannot communicate with remote
master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a
routable IP address.
**************************************************
W1005 17:30:05.347995 2025730048 process.cpp:3194] Attempted to spawn already
running process [email protected]:60330
I1005 17:30:05.348299 2674688 slave.cpp:1263] Forwarding total resources
disk(reservations: [(STATIC,role1)]):1024; cpus:8; mem:15360;
ports:[31000-32000]
I1005 17:30:05.348356 2674688 slave.cpp:1272] Forwarding total oversubscribed
resources {}
I1005 17:30:05.348551 2138112 master.cpp:6804] Received update of agent
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330
(localhost) with total resources disk(reservations: [(STATIC,role1)]):1024;
cpus:8; mem:15360; ports:[31000-32000]
I1005 17:30:05.348595 2138112 master.cpp:6817] Received update of agent
c13b88fc-9b34-4d5e-9641-e4faec3190f1-S0 at slave(459)@127.0.0.1:60330
(localhost) with total oversubscribed resources {}
I1005 17:30:05.349814 2674688 slave.cpp:3535] Updated checkpointed resources
from {} to disk(reservations: [(STATIC,role1)])[id1:path1]:64
{noformat}
> ContentType/MasterAPITest.CreateAndDestroyVolumes is flaky.
> -----------------------------------------------------------
>
> Key: MESOS-7997
> URL: https://issues.apache.org/jira/browse/MESOS-7997
> Project: Mesos
> Issue Type: Bug
> Components: test
> Affects Versions: 1.5.0
> Environment: Ubuntu 17.04 with SSL
> Reporter: Alexander Rukletsov
> Labels: flaky-test, mesosphere
> Attachments: CreateAndDestroyVolumes_badrun.txt,
> CreateAndDestroyVolumes_goodrun.txt
>
>
> Observed a failure on the internal CI:
> {noformat}
> ../../src/tests/api_tests.cpp:3052
> Value of: Resources(offer.resources()).contains( allocatedResources(volume,
> frameworkInfo.role()))
> Actual: false
> Expected: true
> {noformat}
> Full log attached.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)