[
https://issues.apache.org/jira/browse/MESOS-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603662#comment-16603662
]
Chun-Hung Hsiao edited comment on MESOS-9208 at 9/4/18 10:32 PM:
-----------------------------------------------------------------
An interesting observation is that in good runs, the duration between
"{{executor.cpp:693}}" and "{{executor.cpp:994}}" is about 150ms:
{noformat}
I0904 17:07:45.835467 1316 executor.cpp:693] Forked command at 1325
I0904 17:07:45.836033 29817 slave.cpp:5269] Handling status update TASK_RUNNING
(Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 from executor(1)@172.16.10.215:41460
I0904 17:07:45.836513 29817 task_status_update_manager.cpp:328] Received task
status update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f)
for task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.836566 29817 task_status_update_manager.cpp:383] Forwarding task
status update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f)
for task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to the agent
I0904 17:07:45.836644 29817 slave.cpp:5761] Forwarding the update TASK_RUNNING
(Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to [email protected]:34394
I0904 17:07:45.836715 29817 slave.cpp:5654] Task status update manager
successfully handled status update TASK_RUNNING (Status UUID:
d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.836736 29817 slave.cpp:5670] Sending acknowledgement for status
update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for
task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to executor(1)@172.16.10.215:41460
I0904 17:07:45.836864 29817 master.cpp:8332] Status update TASK_RUNNING (Status
UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 from agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 at slave(752)@172.16.10.215:34394
(ip-172-16-10-215.ec2.internal)
I0904 17:07:45.836895 29817 master.cpp:8389] Forwarding status update
TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.837005 29817 master.cpp:10879] Updating the state of task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 (latest state: TASK_RUNNING, status
update state: TASK_RUNNING)
I0904 17:07:45.837113 29817 sched.cpp:1022] Scheduler::statusUpdate took 46577ns
I0904 17:07:45.837211 29817 master.cpp:6198] Processing ACKNOWLEDGE call for
status d833aadb-6973-4a97-a118-61d41f36524f for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 (default) at
[email protected]:34394 on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0
I0904 17:07:45.837340 29817 task_status_update_manager.cpp:401] Received task
status update acknowledgement (UUID: d833aadb-6973-4a97-a118-61d41f36524f) for
task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.837407 29817 slave.cpp:4505] Task status update manager
successfully handled status update acknowledgement (UUID:
d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.838316 29813 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.838372 29813 hierarchical.cpp:1564] Performed allocation for 1
agents in 165353ns
I0904 17:07:45.888976 29816 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.889056 29816 hierarchical.cpp:1564] Performed allocation for 1
agents in 224502ns
I0904 17:07:45.939352 29812 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.939410 29812 hierarchical.cpp:1564] Performed allocation for 1
agents in 163889ns
I0904 17:07:45.985908 1317 executor.cpp:994] Command exited with status 0
(pid: 1325)
{noformat}
But in the bad run it took nearly 2 seconds:
{noformat}
I0904 17:13:05.861194 2022 executor.cpp:693] Forked command at 2027
I0904 17:13:05.858180 30521 task_status_update_manager.cpp:328] Received task
status update TASK_STARTING (Status UUID: 7ed5c225-da67-4aa8-91bb-81078750215d)
for task 8a6c78ae-ffbb-4f17-b977-3c030f2b0353 of framework
c106904d-8084-4d0b-9a51-2cc1f0b55945-0000
I0904 17:13:07.720315 30521 task_status_update_manager.cpp:507] Creating
StatusUpdate stream for task 8a6c78ae-ffbb-4f17-b977-3c030f2b0353 of framework
c106904d-8084-4d0b-9a51-2cc1f0b55945-0000
I0904 17:13:07.720567 2022 executor.cpp:994] Failed to get exit status for
Command (pid: 2027)
{noformat}
And there was no logging activity at all in 17:13:06.
was (Author: chhsia0):
An interesting observation is that in good runs, the duration between the
following log lines is about 150ms:
{noformat}
I0904 17:07:45.835467 1316 executor.cpp:693] Forked command at 1325
I0904 17:07:45.836033 29817 slave.cpp:5269] Handling status update TASK_RUNNING
(Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 from executor(1)@172.16.10.215:41460
I0904 17:07:45.836513 29817 task_status_update_manager.cpp:328] Received task
status update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f)
for task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.836566 29817 task_status_update_manager.cpp:383] Forwarding task
status update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f)
for task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to the agent
I0904 17:07:45.836644 29817 slave.cpp:5761] Forwarding the update TASK_RUNNING
(Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to [email protected]:34394
I0904 17:07:45.836715 29817 slave.cpp:5654] Task status update manager
successfully handled status update TASK_RUNNING (Status UUID:
d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.836736 29817 slave.cpp:5670] Sending acknowledgement for status
update TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for
task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 to executor(1)@172.16.10.215:41460
I0904 17:07:45.836864 29817 master.cpp:8332] Status update TASK_RUNNING (Status
UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 from agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 at slave(752)@172.16.10.215:34394
(ip-172-16-10-215.ec2.internal)
I0904 17:07:45.836895 29817 master.cpp:8389] Forwarding status update
TASK_RUNNING (Status UUID: d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.837005 29817 master.cpp:10879] Updating the state of task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 (latest state: TASK_RUNNING, status
update state: TASK_RUNNING)
I0904 17:07:45.837113 29817 sched.cpp:1022] Scheduler::statusUpdate took 46577ns
I0904 17:07:45.837211 29817 master.cpp:6198] Processing ACKNOWLEDGE call for
status d833aadb-6973-4a97-a118-61d41f36524f for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000 (default) at
[email protected]:34394 on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0
I0904 17:07:45.837340 29817 task_status_update_manager.cpp:401] Received task
status update acknowledgement (UUID: d833aadb-6973-4a97-a118-61d41f36524f) for
task 5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.837407 29817 slave.cpp:4505] Task status update manager
successfully handled status update acknowledgement (UUID:
d833aadb-6973-4a97-a118-61d41f36524f) for task
5b716438-87ae-4ea6-b3b9-5d08a462cbc9 of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.838316 29813 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.838372 29813 hierarchical.cpp:1564] Performed allocation for 1
agents in 165353ns
I0904 17:07:45.888976 29816 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.889056 29816 hierarchical.cpp:1564] Performed allocation for 1
agents in 224502ns
I0904 17:07:45.939352 29812 hierarchical.cpp:2386] Filtered offer with cpus:2;
mem:1024; disk:1024; ports:[31000-32000] on agent
5af3c866-4f98-416b-84a2-b63f7aa71dfd-S0 for role storage of framework
5af3c866-4f98-416b-84a2-b63f7aa71dfd-0000
I0904 17:07:45.939410 29812 hierarchical.cpp:1564] Performed allocation for 1
agents in 163889ns
I0904 17:07:45.985908 1317 executor.cpp:994] Command exited with status 0
(pid: 1325)
{noformat}
But in the bad run it took nearly 2 seconds:
{noformat}
I0904 17:13:05.861194 2022 executor.cpp:693] Forked command at 2027
I0904 17:13:05.858180 30521 task_status_update_manager.cpp:328] Received task
status update TASK_STARTING (Status UUID: 7ed5c225-da67-4aa8-91bb-81078750215d)
for task 8a6c78ae-ffbb-4f17-b977-3c030f2b0353 of framework
c106904d-8084-4d0b-9a51-2cc1f0b55945-0000
I0904 17:13:07.720315 30521 task_status_update_manager.cpp:507] Creating
StatusUpdate stream for task 8a6c78ae-ffbb-4f17-b977-3c030f2b0353 of framework
c106904d-8084-4d0b-9a51-2cc1f0b55945-0000
I0904 17:13:07.720567 2022 executor.cpp:994] Failed to get exit status for
Command (pid: 2027)
{noformat}
And there was no logging activity at all in 17:13:06.
> Test `StorageLocalResourceProviderTest.ROOT_PublishResourcesReboot` is flaky.
> -----------------------------------------------------------------------------
>
> Key: MESOS-9208
> URL: https://issues.apache.org/jira/browse/MESOS-9208
> Project: Mesos
> Issue Type: Bug
> Components: test
> Affects Versions: 1.8.0
> Reporter: Chun-Hung Hsiao
> Priority: Major
> Labels: flaky-test, storage
> Attachments: bad_run.txt
>
>
> Test {{StorageLocalResourceProviderTest.ROOT_PublishResourcesReboot}} is
> observed to be flaky on ubuntu-16.04 with a plain build (i.e., no special
> configuration):
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:2393
> Expected: TASK_FINISHED
> To be equal to: taskFinished->state()
> Which is: TASK_FAILED{noformat}
> However, further investigation shows that the task was failed due to the
> following error:
> {noformat}
> executor.cpp:994] Failed to get exit status for Command (pid: 2027){noformat}
> Which indicates that the executor couldn't reap the forked child, possibly
> because it had been mysteriously reaped already.
> This doesn't sound like a flakiness specific to this particular test, so I'll
> leave the test enabled for now.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)