[
https://issues.apache.org/jira/browse/MESOS-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221645#comment-14221645
]
Charles Baker commented on MESOS-1837:
--------------------------------------
I hit this exact same issue on CentOS 6.5 and was able (through much trial and
error!) to figure out that I had not requested enough memory for my app from
Marathon. Mesos didn't know something was wrong till the app suddenly died and
the /proc/<pid> directory went away. Troubleshooting was confounded by the fact
that I did not get any Java OOM exceptions on stdout or stderr streams but a
tell-tale sign was that my app would only partially startup and never fully
finished initializing. I wonder if Marathon knew the state of the app at this
point? I did not see anything in the syslog indicating such.
As an aside, I will say that setting the isolation, cgroups_root and
cgroups_hierarchy are very important and they do differ radically on CentOS
from Ubuntu. However, when those were wrong such as setting isolation to
cgroups and _root and _hierarchy were incorrect mainly caused the slave to not
startup at all. The real problem with this particular error is that the
/proc/<pid> gets pulled out from under Mesos.
The error message itself is actually pretty specific but I imagine this can
happen not just with insufficient memory but with any kind of problem that
would cause the app to terminate abruptly.
> failed to determine cgroup for the 'cpu' subsystem
> --------------------------------------------------
>
> Key: MESOS-1837
> URL: https://issues.apache.org/jira/browse/MESOS-1837
> Project: Mesos
> Issue Type: Bug
> Components: general
> Affects Versions: 0.20.1
> Environment: Ubuntu 14.04
> Reporter: Chris Fortier
>
> Attempting to launch Docker container with Marathon. Container is launched
> then fails.
> A search of /var/log/syslog reveals:
> Sep 27 03:01:43 vagrant-ubuntu-trusty-64 mesos-slave[1409]: E0927
> 03:01:43.546957 1463 slave.cpp:2205] Failed to update resources for
> container 8c2429d9-f090-4443-8108-0206ca37f3fd of executor
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 running task
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 on status update for
> terminal task, destroying container: Failed to determine cgroup for the 'cpu'
> subsystem: Failed to read /proc/9792/cgroup: Failed to open file
> '/proc/9792/cgroup': No such file or directory
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)