[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

Charles Natali (Jira) Thu, 23 Apr 2020 15:12:19 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090969#comment-17090969
 ]


Charles Natali commented on MESOS-8038:
---------------------------------------

See log attached ([^mesos_agent.log]).

See my interpretation below, keeping in mind that I'm not familiar with the 
code so might be completely wrong :).

Before the error occurs, we can see the following warning in the agent log:

{noformat}
W0423 22:46:19.277667 20524 containerizer.cpp:2428] Ignoring update for 
currently being destroyed container 6f446173-2bba-4cc4-bc15-c956bc159d4e
{noformat}
 

Looking at the logs, we can see that compared to a successful run, the 
containerizer's update method is called while the container is being destroyed.

Example of a successful task - the slave receives the task status update, 
forwards it, and then sends back the acknowledgement which cause sthe executor 
to exit, and the container to be destroyed, therefore after the task status 
update has been processed:
 
{noformat}
I0423 22:43:55.771867 20519 slave.cpp:5950] Handling status update 
TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 from executor(1)@127.0.1.1:35471
I0423 22:43:55.787933 20518 memory.cpp:287] Updated 
'memory.soft_limit_in_bytes' to 32MB for container 
5a7984a6-cecb-40ea-843c-8ed28cd92330
I0423 22:43:55.788524 20523 cpu.cpp:94] Updated 'cpu.shares' to 102 (cpus 0.1) 
for container 5a7984a6-cecb-40ea-843c-8ed28cd92330
I0423 22:43:55.794132 20522 task_status_update_manager.cpp:328] Received task 
status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) 
for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000
I0423 22:43:55.794495 20522 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) 
for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 to the agent
I0423 22:43:55.795053 20522 slave.cpp:6496] Forwarding the update TASK_FINISHED 
(Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 to [email protected]:5050
I0423 22:43:55.812129 20522 slave.cpp:6380] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000
I0423 22:43:55.813238 20522 slave.cpp:6407] Sending acknowledgement for status 
update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for 
task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 to executor(1)@127.0.1.1:35471
[...]
I0423 22:43:57.005844 20521 slave.cpp:6676] Got exited event for 
executor(1)@127.0.1.1:35471
I0423 22:43:57.205157 20522 containerizer.cpp:3159] Container 
5a7984a6-cecb-40ea-843c-8ed28cd92330 has exited
I0423 22:43:57.205278 20522 containerizer.cpp:2623] Destroying container 
5a7984a6-cecb-40ea-843c-8ed28cd92330 in RUNNING state
I0423 22:43:57.205379 20522 containerizer.cpp:3321] Transitioning the state of 
container 5a7984a6-cecb-40ea-843c-8ed28cd92330 from RUNNING to DESTROYING after 
4.612041984secs
I0423 22:43:57.206100 20523 linux_launcher.cpp:564] Asked to destroy container 
5a7984a6-cecb-40ea-843c-8ed28cd92330
{noformat}
 

Now let's look at what happens when the task right before the task which fails 
with "Requested 1 gpus but only 0 available" finishes:
 
{noformat}
I0423 22:46:16.506460 20519 slave.cpp:5950] Handling status update 
TASK_FINISHED (Status UUID: 4b7a01c5-15af-47a3-b06b-5ed8f7d65405) for task 
task-650af3bd-3f5b-4e17-9d34-4642480b4da0 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 from executor(1)@127.0.1.1:36541
I0423 22:46:17.560580 20521 slave.cpp:6676] Got exited event for 
executor(1)@127.0.1.1:36541
I0423 22:46:18.701063 20523 linux_launcher.cpp:638] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/8a4e52e5-eab6-43e7-8bd1-9b9248614e69'
I0423 22:46:19.236407 20525 slave.cpp:7076] Executor 
'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 exited with status 0
I0423 22:46:19.237376 20525 slave.cpp:7187] Cleaning up executor 
'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 at executor(1)@127.0.1.1:41227
I0423 22:46:19.241185 20522 gc.cpp:95] Scheduling 
'/tmp/mesos_agent/work/slaves/0142aec2-d0c1-4011-8340-d81107d40fce-S0/frameworks/0142aec2-d0c1-4011-8340-d81107d40fce-0000/executors/task-376cdda6-760b-4d3b-ad7f-2d86916695a3/runs/0c718138-d6f3-42d4-9de7-4dac7d518dc5'
 for
 gc 9.9959984512mins in the future
I0423 22:46:19.249259 20522 gc.cpp:95] Scheduling 
'/tmp/mesos_agent/work/slaves/0142aec2-d0c1-4011-8340-d81107d40fce-S0/frameworks/0142aec2-d0c1-4011-8340-d81107d40fce-0000/executors/task-376cdda6-760b-4d3b-ad7f-2d86916695a3'
 for gc 9.9959792mins in the future
I0423 22:46:19.254784 20524 containerizer.cpp:3159] Container 
6f446173-2bba-4cc4-bc15-c956bc159d4e has exited
I0423 22:46:19.254947 20524 containerizer.cpp:2623] Destroying container 
6f446173-2bba-4cc4-bc15-c956bc159d4e in RUNNING state
I0423 22:46:19.255359 20524 containerizer.cpp:3321] Transitioning the state of 
container 6f446173-2bba-4cc4-bc15-c956bc159d4e from RUNNING to DESTROYING after 
5.056560128secs
W0423 22:46:19.277667 20524 containerizer.cpp:2428] Ignoring update for 
currently being destroyed container 6f446173-2bba-4cc4-bc15-c956bc159d4e
I0423 22:46:19.279716 20519 task_status_update_manager.cpp:328] Received task 
status update TASK_FINISHED (Status UUID: 4b7a01c5-15af-47a3-b06b-5ed8f7d65405) 
for task task-650af3bd-3f5b-4e17-9d34-4642480b4da0 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000
I0423 22:46:19.280694 20519 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: 4b7a01c5-15af-47a3-b06b-5ed8f7d65405) 
for task task-650af3bd-3f5b-4e17-9d34-4642480b4da0 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-0000 to th
e agent
I0423 22:46:19.280125 20524 linux_launcher.cpp:564] Asked to destroy container 
6f446173-2bba-4cc4-bc15-c956bc159d4e
{noformat}

We can see "Ignoring update for currently being destroyed container...", and I 
think that's what causing the problem.


Looking at the slave code we can see this 
([https://github.com/apache/mesos/blob/ad612599df9e866b2a622baa4cdb659ece5c4574/src/slave/slave.cpp#L6285):]

{noformat}
  if (protobuf::isTerminalState(status.state())) {
    // If the task terminated, wait until the container's resources
    // have been updated before sending the status update. Note that
    // duplicate terminal updates are not possible here because they
    // lead to an error from `Executor::updateTaskState`.
    containerizer->update(
        executor->containerId,
        executor->allocatedResources(),
        computeExecutorLimits(
            executor->info.resources(),
            executor->queuedTasks.values(),
            executor->launchedTasks.values()))
      .onAny(defer(self(),
                   &Slave::__statusUpdate,
                   lambda::_1,
                   update,
                   pid,
                   executor->id,
                   executor->containerId,
                   executor->checkpoint));
{noformat}
 
So when the task reaches a terminal state, we send the status update right 
after we've updated the containerizer. But the update method contains this 
([https://github.com/apache/mesos/blob/ad612599df9e866b2a622baa4cdb659ece5c4574/src/slave/containerizer/mesos/containerizer.cpp#L2427)]

{noformat}
  if (container->state == DESTROYING) {
    LOG(WARNING) << "Ignoring update for currently being destroyed "
                 << "container " << containerId;
    return Nothing();
  }
{noformat}

So IIUC if the container is being destroyed, it's a no-op, and we don't update 
the container's resources, and I guess that from that point the agent assumes 
that the corresponding resources - among which the gpu - are available?


I did a test removing the above early return when the container is destroying, 
and with that change I couldn't reproduce the problem using my test harness.

Obviously I don't know enough about the container lifecycle etc to know if 
that's a reasonable thing to do.

> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

Reply via email to