[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088568#comment-17088568
 ] 

Andrei Budnik commented on MESOS-10119:
---------------------------------------

Could you reproduce the cgroups desctruction problem consistently?
What are the kernel and systemd versions installed on your agents?

> failure to destroy container can cause the agent to "leak" a GPU
> ----------------------------------------------------------------
>
>                 Key: MESOS-10119
>                 URL: https://issues.apache.org/jira/browse/MESOS-10119
>             Project: Mesos
>          Issue Type: Task
>          Components: agent, containerization
>            Reporter: Charles Natali
>            Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 87253521-8d39-47ea-b4d1-febe527d230c
> Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 
> process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed 
> connect: connection closed
> Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.310817 2141 
> slave.cpp:6889] Container '257b45f1-8582-4cb5-8138-454e9697bfe4' for executor 
> 'task_3:6bdd99ca-7a2b-f19c-bbb3-d9478fe8f81e' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.311614 2141 
> memory.cpp:637] Listening on OOM events failed for container 
> 257b45f1-8582-4cb5-8138-454e9697bfe4: Event listener is terminating
> Apr 17 17:00:10 engpuc006 mesos-slave[2068]: W0417 17:00:10.316352 2149 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 257b45f1-8582-4cb5-8138-454e9697bfe4
> Apr 17 17:00:12 engpuc006 mesos-slave[2068]: E0417 17:00:12.675910 2146 
> slave.cpp:6889] Container '47f61e7b-8022-4731-93fe-36a356920a4e' for executor 
> 'task_2:9de8db83-39fa-b24a-b5b6-2ef6f1e1713c' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:12 engpuc006 mesos-slave[2068]: E0417 17:00:12.676764 2139 
> memory.cpp:637] Listening on OOM events failed for container 
> 47f61e7b-8022-4731-93fe-36a356920a4e: Event listener is terminating
> Apr 17 17:00:12 engpuc006 mesos-slave[2068]: W0417 17:00:12.681100 2150 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 47f61e7b-8022-4731-93fe-36a356920a4e
>  
> {noformat}
>  
>  
> I'll see if I can reproduce and debug the cgroup destruction problem and 
> report my findings to https://issues.apache.org/jira/browse/MESOS-9950), 
> however I think that the bigger issue is the fact that the agent leaks the 
> GPU.
>  
> I'm wondering which how the following makes sense:
>  # Consider the failure of destruction a container a fatal error: this turns 
> this byzantine fault into a clean crash failure, which forces the operator to 
> detect it and deal with it.
>  # Fix the leak by at least not advertising the GPU anymore - obviously the 
> downside if that the operator might not see it and we lose a GPU.
>  
> [~bmahler] [~abudnik]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to