Geoffroy Jabouley created MESOS-2508:
----------------------------------------
Summary: Slave recovering a docker container results in "Unknow
container" error
Key: MESOS-2508
URL: https://issues.apache.org/jira/browse/MESOS-2508
Project: Mesos
Issue Type: Bug
Components: containerization, docker, slave
Affects Versions: 0.21.1
Environment: Ubuntu 14.04.2 LTS
Docker 1.5.0 (same error with 1.4.1)
Mesos 0.21.1 installed from mesosphere ubuntu repo
Marathon 0.8.0 installed from mesosphere ubuntu repo
Reporter: Geoffroy Jabouley
I'm seeing some error logs occuring during a slave recovery of a Mesos task
running into a docker container.
It does not impede slave recovery process, as the mesos task is still active
and running on the slave afeter the recovery.
But there is something not working properly when the slave is recovering my
docker container. The slave detects my container as an "Unknown container"
Cluster status:
- 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host.
- checkpointing is activated on both slave and framework
- use native docker containerizer
- 1 mesos task, started using marathon, is running inside a docker container
and is monitored by the mesos-slave
Action:
- restart the mesos-slave process (sudo restart mesos-slave)
Expected:
- docker container still running
- mesos task still running
- no error in the mesos slave log regarding recovery process
Seen:
- docker container still running
- mesos task still running
- {color:red}Several errors *Unknown container* in the mesos slave log during
recovery process{color}
-----------------------------------
For what it forth, here are my investigations:
1) The mesos task starts fine in the docker container
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker
container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by Mesos
docker containerizer _i guess_...
{code}
I0317 09:56:14.300439 2784 slave.cpp:1083] Got assigned task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.380702 2784 slave.cpp:1193] Launching task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.384466 2784 slave.cpp:3997] Launching executor
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 in work directory
'/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad'
I0317 09:56:14.390207 2784 slave.cpp:1316] Queuing task
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
'20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.421787 2782 docker.cpp:927] Starting container
'adb71dc4-c07d-42a9-8fed-264c241668ad' for task
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework
'20150311-150951-3982541578-5050-50860-0000'
I0317 09:56:15.784143 2781 docker.cpp:633] Checkpointing pid 27080 to
'/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid'
I0317 09:56:15.789443 2784 slave.cpp:2840] Monitoring executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
'20150311-150951-3982541578-5050-50860-0000' in container
'adb71dc4-c07d-42a9-8fed-264c241668ad'
I0317 09:56:15.862642 2784 slave.cpp:1860] Got registration for executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:36021
I0317 09:56:15.865319 2784 slave.cpp:1979] Flushing queued task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.885414 2787 slave.cpp:2215] Handling status update TASK_RUNNING
(UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:36021
I0317 09:56:15.885902 2787 status_update_manager.cpp:317] Received status
update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.886327 2787 status_update_manager.hpp:346] Checkpointing UPDATE
for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for
task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.904325 2785 docker.cpp:1298] Updated 'cpu.shares' to 204 at
/sys/fs/cgroup/cpu/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
for container adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:15.904831 2785 docker.cpp:1333] Updated
'memory.soft_limit_in_bytes' to 96MB for container
adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:15.905591 2785 docker.cpp:1359] Updated 'memory.limit_in_bytes' to
96MB at
/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
for container adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:16.073200 2787 slave.cpp:2458] Forwarding the update TASK_RUNNING
(UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to [email protected]:5050
I0317 09:56:16.073386 2787 slave.cpp:2391] Sending acknowledgement for status
update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:36021
I0317 09:56:16.226022 2788 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:16.226093 2788 status_update_manager.hpp:346] Checkpointing ACK
for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for
task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
{code}
{code}
$ docker ps --no-trunc
CONTAINER ID IMAGE
COMMAND CREATED
STATUS PORTS NAMES
e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
ecpff/ecp_base:latest "/bin/sh -c 'env && ls -Ral /mnt && sleep 3600s'" 11
minutes ago Up 11 minutes
mesos-adb71dc4-c07d-42a9-8fed-264c241668ad
{code}
2) I restart the mesos slave process
{code}
$ sudo restart mesos-slave
{code}
{code}
W0317 09:56:21.672792 2754 logging.cpp:81] RAW: Received signal SIGTERM from
process 1 of user 0; exiting
[DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos
--checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.96.237
--ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
I0317 09:56:48.411228 27147 main.cpp:142] Build: 2015-01-09 02:25:56 by root
I0317 09:56:48.411418 27147 main.cpp:144] Version: 0.21.1
I0317 09:56:48.411427 27147 main.cpp:147] Git tag: 0.21.1
I0317 09:56:48.411434 27147 main.cpp:151] Git SHA:
2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8
I0317 09:56:48.521886 27147 containerizer.cpp:100] Using isolation:
cgroups/cpu,cgroups/mem
I0317 09:56:48.570104 27147 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer
as the freezer hierarchy for the Linux launcher
I0317 09:56:48.577716 27147 main.cpp:165] Starting Mesos slave
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@716: Client
environment:host.name=server-ubuntu-updatecluster
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@724: Client
environment:os.arch=3.13.0-46-generic
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@725: Client
environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015
I0317 09:56:48.580657 27147 slave.cpp:169] Slave started on
1)@10.195.96.237:5051
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000
watcher=0x7f450939a6a0 sessionId=0 sessionPasswd=<null> context=0x7f44f0000b70
flags=0
I0317 09:56:48.581478 27147 slave.cpp:289] Slave resources:
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974
I0317 09:56:48.581533 27147 slave.cpp:318] Slave hostname: 10.195.96.237
I0317 09:56:48.581548 27147 slave.cpp:319] Slave checkpoint: true
I0317 09:56:48.583925 27175 state.cpp:33] Recovering state from
'/tmp/mesos/meta'
2015-03-17 09:56:48,586:27147(0x7f4501196700):ZOO_INFO@check_events@1703:
initiated connection to server [10.195.96.237:2181]
2015-03-17 09:56:48,589:27147(0x7f4501196700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.195.96.237:2181],
sessionId=0x14c0e422f7f0009, negotiated timeout=10000
I0317 09:56:48.589994 27181 group.cpp:313] Group process
(group(1)@10.195.96.237:5051) connected to ZooKeeper
I0317 09:56:48.590104 27181 group.cpp:790] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0317 09:56:48.590137 27181 group.cpp:385] Trying to create path '/mesos' in
ZooKeeper
I0317 09:56:48.592377 27181 detector.cpp:138] Detected a new leader: (id='2')
I0317 09:56:48.592711 27181 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master
([email protected]:5050) is detected
{code}
3) Mesos slave tries to recover the docker container
*adb71dc4-c07d-42a9-8fed-264c241668ad*. Shouldn't be container
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330* ?
{code}
I0317 09:56:48.592711 27181 group.cpp:659] Trying to get
'/mesos/info_0000000002' in ZooKeeper
I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master
([email protected]:5050) is detected
I0317 09:56:48.613559 27175 slave.cpp:3548] Recovering framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.613596 27175 slave.cpp:4116] Recovering executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.613919 27181 status_update_manager.cpp:197] Recovering status
update manager
I0317 09:56:48.613945 27181 status_update_manager.cpp:205] Recovering executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.614933 27181 docker.cpp:767] Recovering Docker containers
I0317 09:56:48.614982 27181 docker.cpp:814] Recovering container
'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.616395 27174 containerizer.cpp:281] Recovering containerizer
I0317 09:56:48.616457 27174 containerizer.cpp:323] Recovering container
'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
{code}
4) It fails because it *does not find any cgroup for container
adb71dc4-c07d-42a9-8fed-264c241668ad*.
Again is the "adb71dc4-c07d-42a9-8fed-264c241668ad" id something internal to
Mesos?
The docker container has an id of
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*, and for
example
_/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
and
_/sys/fs/cgroup/freezer/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
are presents on the file system.
{code}
I0317 09:56:48.616981 27174 linux_launcher.cpp:169] Couldn't find freezer
cgroup for container adb71dc4-c07d-42a9-8fed-264c241668ad, assuming already
destroyed
W0317 09:56:48.634775 27175 cpushare.cpp:203] Couldn't find cgroup for
container adb71dc4-c07d-42a9-8fed-264c241668ad
E0317 09:56:48.636764 27180 containerizer.cpp:1140] Error in a resource
limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown container
I0317 09:56:48.636808 27180 containerizer.cpp:946] Destroying container
'adb71dc4-c07d-42a9-8fed-264c241668ad'
E0317 09:56:48.640025 27175 containerizer.cpp:1140] Error in a resource
limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown container
I0317 09:56:48.823640 27177 slave.cpp:3407] Sending reconnect request to
executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:36021
I0317 09:56:48.826930 27178 slave.cpp:2039] Re-registering executor
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
W0317 09:56:49.829905 27181 containerizer.cpp:888] Skipping resource statistic
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:49.830080 27181 containerizer.cpp:888] Skipping resource statistic
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
I0317 09:56:50.824409 27178 slave.cpp:2164] Cleaning up un-reregistered
executors
I0317 09:56:50.824524 27178 slave.cpp:3466] Finished recovery
{code}
5) Loop error message on the slave log, spamming the log file
{code}
W0317 09:56:50.830857 27175 containerizer.cpp:888] Skipping resource statistic
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:50.830904 27175 containerizer.cpp:888] Skipping resource statistic
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:51.832145 27177 containerizer.cpp:888] Skipping resource statistic
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)