[
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357071#comment-14357071
]
Geoffroy Jabouley edited comment on MESOS-2419 at 3/11/15 4:35 PM:
-------------------------------------------------------------------
Hello
i'am seeing the same behavior on my test cluster using mesos 0.21.1 / marathon
0.8.0 / docker 1.5.0 on ubuntu 14.04.2 LTS
For narrowing the issue, i have only one master / one slave running on the same
machine. Only one mesos tasks is running into docker. Checkpointing is
activated.
When i restart the mesos-slave process (sudo restart mesos-slave), the recovery
of the running task seems to work fine as the docker container is still running
and the mesos task sandbox log says:
{code}
I0311 16:33:59.507975 59814 exec.cpp:450] Slave exited, but framework has
checkpointing enabled. Waiting 2hrs to reconnect with slave
20150311-150951-3982541578-5050-50860-S0
I0311 16:33:59.988651 59816 exec.cpp:252] Received reconnect request from slave
20150311-150951-3982541578-5050-50860-S0
I0311 16:34:00.000725 59814 exec.cpp:229] Executor re-registered on slave
20150311-150951-3982541578-5050-50860-S0
{code}
However, in the mesos slave logs, i get several warnings/errors related to
recovery:
{quote}
I0311 16:06:21.167474 54771 docker.cpp:927] Starting container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for task
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' (and executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799') of framework
'20150311-150951-3982541578-5050-50860-0000'
I0311 16:06:21.561771 54769 docker.cpp:633] Checkpointing pid 54855 to
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.2d2d5c64-c800-11e4-946b-56847afe9799/runs/41e12894-611f-4f19-a2e4-721e4ef984c7/pids/forked.pid'
I0311 16:06:21.567205 54773 slave.cpp:2840] Monitoring executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
'20150311-150951-3982541578-5050-50860-0000' in container
'41e12894-611f-4f19-a2e4-721e4ef984c7'
I0311 16:06:21.589617 54773 slave.cpp:1860] Got registration for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:44967
I0311 16:06:21.592205 54773 slave.cpp:1979] Flushing queued task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:06:21.600999 54768 slave.cpp:2215] Handling status update TASK_RUNNING
(UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:44967
I0311 16:06:21.601176 54768 status_update_manager.cpp:317] Received status
update TASK_RUNNING (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
{color:red}I0311 16:06:21.601415 54768 status_update_manager.hpp:346]
Checkpointing UPDATE for status update TASK_RUNNING (UUID:
4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000{color}
I0311 16:06:21.606981 54768 slave.cpp:2458] Forwarding the update TASK_RUNNING
(UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to [email protected]:5050
I0311 16:06:21.607081 54768 slave.cpp:2391] Sending acknowledgement for status
update TASK_RUNNING (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:44967
I0311 16:06:21.620201 54770 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
{color:red}I0311 16:06:21.620247 54770 status_update_manager.hpp:346]
Checkpointing ACK for status update TASK_RUNNING (UUID:
4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000{color}
I0311 16:06:21.670255 54770 docker.cpp:1298] Updated 'cpu.shares' to 204 at
/sys/fs/cgroup/cpu/docker/1fea9dcd09193841cdad8c6aa3ff6937d439dade02371defdafdf9c24ae00fa3
for container 41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:06:21.671269 54770 docker.cpp:1333] Updated
'memory.soft_limit_in_bytes' to 544MB for container
41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:06:21.673033 54770 docker.cpp:1359] Updated 'memory.limit_in_bytes' to
544MB at
/sys/fs/cgroup/memory/docker/1fea9dcd09193841cdad8c6aa3ff6937d439dade02371defdafdf9c24ae00fa3
for container 41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:07:06.194358 54773 slave.cpp:3321] Current usage 20.33%. Max allowed
age: 4.876765601677952days
I0311 16:07:12.411705 54770 http.cpp:330] HTTP request for
'/slave(1)/state.json'
I0311 16:08:06.195019 54768 slave.cpp:3321] Current usage 20.63%. Max allowed
age: 4.856005308606180days
I0311 16:09:06.197270 54771 slave.cpp:3321] Current usage 20.98%. Max allowed
age: 4.831071544522141days
I0311 16:10:06.198051 54770 slave.cpp:3321] Current usage 20.98%. Max allowed
age: 4.831323626843079days
I0311 16:11:06.198801 54773 slave.cpp:3321] Current usage 20.99%. Max allowed
age: 4.831041372506898days
{color:red}W0311 16:11:06.198801 54741 logging.cpp:81] RAW: Received signal
SIGTERM from process 1 of user 0; exiting
[DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos
--checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.96.237
--ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict
--resources=ports:[31000-32000,80,443]{color}
{color:blue}(USER NOTE: this is the restart){color}
I0311 16:11:16.007556 58147 main.cpp:142] Build: 2015-01-09 02:25:56 by root
I0311 16:11:16.007716 58147 main.cpp:144] Version: 0.21.1
I0311 16:11:16.007725 58147 main.cpp:147] Git tag: 0.21.1
I0311 16:11:16.007732 58147 main.cpp:151] Git SHA:
2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8
I0311 16:11:16.112568 58147 containerizer.cpp:100] Using isolation:
cgroups/cpu,cgroups/mem
I0311 16:11:16.141803 58147 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer
as the freezer hierarchy for the Linux launcher
I0311 16:11:16.142117 58147 main.cpp:165] Starting Mesos slave
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@716: Client
environment:host.name=server-ubuntu-updatecluster
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@724: Client
environment:os.arch=3.13.0-46-generic
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@725: Client
environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015
I0311 16:11:16.143803 58147 slave.cpp:169] Slave started on
1)@10.195.96.237:5051
I0311 16:11:16.144503 58147 slave.cpp:289] Slave resources:
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
I0311 16:11:16.144572 58147 slave.cpp:318] Slave hostname: 10.195.96.237
I0311 16:11:16.144585 58147 slave.cpp:319] Slave checkpoint: true
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000
watcher=0x7fd9427fe6a0 sessionId=0 sessionPasswd=<null> context=0x7fd924001180
flags=0
I0311 16:11:16.148352 58181 state.cpp:33] Recovering state from
'/tmp/mesos/meta'
2015-03-11 16:11:16,153:58147(0x7fd93a5fa700):ZOO_INFO@check_events@1703:
initiated connection to server [10.195.96.237:2181]
2015-03-11 16:11:16,153:58147(0x7fd93a5fa700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.195.96.237:2181],
sessionId=0x14c092b886e000c, negotiated timeout=10000
I0311 16:11:16.153558 58177 group.cpp:313] Group process
(group(1)@10.195.96.237:5051) connected to ZooKeeper
I0311 16:11:16.153606 58177 group.cpp:790] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0311 16:11:16.153625 58177 group.cpp:385] Trying to create path '/mesos' in
ZooKeeper
I0311 16:11:16.158941 58180 slave.cpp:3548] Recovering framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.158977 58177 detector.cpp:138] Detected a new leader: (id='1')
I0311 16:11:16.159014 58180 slave.cpp:4116] Recovering executor
'test-app.753099d3-c7fd-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159150 58177 group.cpp:659] Trying to get
'/mesos/info_0000000001' in ZooKeeper
I0311 16:11:16.159304 58178 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799/runs/43014fa8-3304-42d9-9a61-036f67f0ab37'
for gc 6.99518334248889days in the future
I0311 16:11:16.159323 58180 slave.cpp:4116] Recovering executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159368 58178 gc.cpp:56] Scheduling
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799/runs/43014fa8-3304-42d9-9a61-036f67f0ab37'
for gc 6.99518334201481days in the future
I0311 16:11:16.159409 58178 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799'
for gc 6.99518334167704days in the future
I0311 16:11:16.159430 58178 gc.cpp:56] Scheduling
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799'
for gc 6.99518334146963days in the future
I0311 16:11:16.159878 58178 status_update_manager.cpp:197] Recovering status
update manager
I0311 16:11:16.159903 58178 status_update_manager.cpp:205] Recovering executor
'test-app.753099d3-c7fd-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159957 58178 status_update_manager.cpp:205] Recovering executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.160738 58176 docker.cpp:767] Recovering Docker containers
I0311 16:11:16.160851 58176 docker.cpp:814] Recovering container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.161068 58178 containerizer.cpp:281] Recovering containerizer
I0311 16:11:16.161124 58178 containerizer.cpp:323] Recovering container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.163920 58177 detector.cpp:433] A new leading master
([email protected]:5050) is detected
{color:red}I0311 16:11:16.174446 58178 linux_launcher.cpp:169] Couldn't find
freezer cgroup for container 41e12894-611f-4f19-a2e4-721e4ef984c7, assuming
already destroyed
W0311 16:11:16.175457 58178 cpushare.cpp:203] Couldn't find cgroup for
container 41e12894-611f-4f19-a2e4-721e4ef984c7
E0311 16:11:16.177146 58178 containerizer.cpp:1140] Error in a resource
limitation for container 41e12894-611f-4f19-a2e4-721e4ef984c7: Unknown container
I0311 16:11:16.177176 58178 containerizer.cpp:946] Destroying container
'41e12894-611f-4f19-a2e4-721e4ef984c7'
E0311 16:11:16.177522 58178 containerizer.cpp:1140] Error in a resource
limitation for container 41e12894-611f-4f19-a2e4-721e4ef984c7: Unknown
container{color}
I0311 16:11:16.319142 58175 slave.cpp:3407] Sending reconnect request to
executor test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:44967
I0311 16:11:16.322082 58179 slave.cpp:2039] Re-registering executor
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
W0311 16:11:17.323957 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:17.324007 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
I0311 16:11:18.320060 58179 slave.cpp:2164] Cleaning up un-reregistered
executors
I0311 16:11:18.320137 58179 slave.cpp:3466] Finished recovery
I0311 16:11:18.320715 58179 slave.cpp:602] New master detected at
[email protected]:5050
I0311 16:11:18.320878 58179 slave.cpp:627] No credentials provided. Attempting
to register without authentication
I0311 16:11:18.320901 58179 slave.cpp:638] Detecting new master
I0311 16:11:18.320925 58179 status_update_manager.cpp:171] Pausing sending
status updates
W0311 16:11:18.325335 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:18.325383 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
I0311 16:11:19.072113 58174 slave.cpp:824] Re-registered with master
[email protected]:5050
I0311 16:11:19.072285 58174 slave.cpp:1710] Updating framework
20150311-150951-3982541578-5050-50860-0000 pid to
[email protected]:39327
I0311 16:11:19.072485 58177 status_update_manager.cpp:178] Resuming sending
status updates
I0311 16:11:19.073189 58174 status_update_manager.cpp:178] Resuming sending
status updates
{color:red}W0311 16:11:19.326741 58174 containerizer.cpp:888] Skipping resource
statistic for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown
container
W0311 16:11:19.326815 58174 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:20.327874 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:20.327947 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:21.329118 58177 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown
container{color}
{color:blue}(USER NOTE: this is printed every seconds in the log, increasing
quickly disk usage){color}
{quote}
was (Author: geoffroy.jabouley):
Hello
i'am seeing the same behavior on my test cluster using mesos 0.21.1 / marathon
0.8.1 / docker 1.5.0 on ubuntu 14.04.2 LTS
For narrowing the issue, i have only one master / one slave running on the same
machine. Only one mesos tasks is running into docker. Checkpointing is
activated.
When i restart the mesos-slave process (sudo restart mesos-slave), the recovery
of the running task seems to work fine as the docker container is still running
and the mesos task sandbox log says:
{code}
I0311 16:33:59.507975 59814 exec.cpp:450] Slave exited, but framework has
checkpointing enabled. Waiting 2hrs to reconnect with slave
20150311-150951-3982541578-5050-50860-S0
I0311 16:33:59.988651 59816 exec.cpp:252] Received reconnect request from slave
20150311-150951-3982541578-5050-50860-S0
I0311 16:34:00.000725 59814 exec.cpp:229] Executor re-registered on slave
20150311-150951-3982541578-5050-50860-S0
{code}
However, in the mesos slave logs, i get several warnings/errors related to
recovery:
{quote}
I0311 16:06:21.167474 54771 docker.cpp:927] Starting container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for task
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' (and executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799') of framework
'20150311-150951-3982541578-5050-50860-0000'
I0311 16:06:21.561771 54769 docker.cpp:633] Checkpointing pid 54855 to
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.2d2d5c64-c800-11e4-946b-56847afe9799/runs/41e12894-611f-4f19-a2e4-721e4ef984c7/pids/forked.pid'
I0311 16:06:21.567205 54773 slave.cpp:2840] Monitoring executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
'20150311-150951-3982541578-5050-50860-0000' in container
'41e12894-611f-4f19-a2e4-721e4ef984c7'
I0311 16:06:21.589617 54773 slave.cpp:1860] Got registration for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:44967
I0311 16:06:21.592205 54773 slave.cpp:1979] Flushing queued task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:06:21.600999 54768 slave.cpp:2215] Handling status update TASK_RUNNING
(UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:44967
I0311 16:06:21.601176 54768 status_update_manager.cpp:317] Received status
update TASK_RUNNING (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
{color:red}I0311 16:06:21.601415 54768 status_update_manager.hpp:346]
Checkpointing UPDATE for status update TASK_RUNNING (UUID:
4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000{color}
I0311 16:06:21.606981 54768 slave.cpp:2458] Forwarding the update TASK_RUNNING
(UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to [email protected]:5050
I0311 16:06:21.607081 54768 slave.cpp:2391] Sending acknowledgement for status
update TASK_RUNNING (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:44967
I0311 16:06:21.620201 54770 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: 4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
{color:red}I0311 16:06:21.620247 54770 status_update_manager.hpp:346]
Checkpointing ACK for status update TASK_RUNNING (UUID:
4a7375f5-007a-46b2-998c-f584d8cfbf28) for task
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000{color}
I0311 16:06:21.670255 54770 docker.cpp:1298] Updated 'cpu.shares' to 204 at
/sys/fs/cgroup/cpu/docker/1fea9dcd09193841cdad8c6aa3ff6937d439dade02371defdafdf9c24ae00fa3
for container 41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:06:21.671269 54770 docker.cpp:1333] Updated
'memory.soft_limit_in_bytes' to 544MB for container
41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:06:21.673033 54770 docker.cpp:1359] Updated 'memory.limit_in_bytes' to
544MB at
/sys/fs/cgroup/memory/docker/1fea9dcd09193841cdad8c6aa3ff6937d439dade02371defdafdf9c24ae00fa3
for container 41e12894-611f-4f19-a2e4-721e4ef984c7
I0311 16:07:06.194358 54773 slave.cpp:3321] Current usage 20.33%. Max allowed
age: 4.876765601677952days
I0311 16:07:12.411705 54770 http.cpp:330] HTTP request for
'/slave(1)/state.json'
I0311 16:08:06.195019 54768 slave.cpp:3321] Current usage 20.63%. Max allowed
age: 4.856005308606180days
I0311 16:09:06.197270 54771 slave.cpp:3321] Current usage 20.98%. Max allowed
age: 4.831071544522141days
I0311 16:10:06.198051 54770 slave.cpp:3321] Current usage 20.98%. Max allowed
age: 4.831323626843079days
I0311 16:11:06.198801 54773 slave.cpp:3321] Current usage 20.99%. Max allowed
age: 4.831041372506898days
{color:red}W0311 16:11:06.198801 54741 logging.cpp:81] RAW: Received signal
SIGTERM from process 1 of user 0; exiting
[DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos
--checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.96.237
--ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict
--resources=ports:[31000-32000,80,443]{color}
{color:blue}(USER NOTE: this is the restart){color}
I0311 16:11:16.007556 58147 main.cpp:142] Build: 2015-01-09 02:25:56 by root
I0311 16:11:16.007716 58147 main.cpp:144] Version: 0.21.1
I0311 16:11:16.007725 58147 main.cpp:147] Git tag: 0.21.1
I0311 16:11:16.007732 58147 main.cpp:151] Git SHA:
2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8
I0311 16:11:16.112568 58147 containerizer.cpp:100] Using isolation:
cgroups/cpu,cgroups/mem
I0311 16:11:16.141803 58147 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer
as the freezer hierarchy for the Linux launcher
I0311 16:11:16.142117 58147 main.cpp:165] Starting Mesos slave
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@716: Client
environment:host.name=server-ubuntu-updatecluster
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@724: Client
environment:os.arch=3.13.0-46-generic
2015-03-11 16:11:16,142:58147(0x7fd93e63d700):ZOO_INFO@log_env@725: Client
environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015
I0311 16:11:16.143803 58147 slave.cpp:169] Slave started on
1)@10.195.96.237:5051
I0311 16:11:16.144503 58147 slave.cpp:289] Slave resources:
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
I0311 16:11:16.144572 58147 slave.cpp:318] Slave hostname: 10.195.96.237
I0311 16:11:16.144585 58147 slave.cpp:319] Slave checkpoint: true
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-03-11 16:11:16,144:58147(0x7fd93e63d700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000
watcher=0x7fd9427fe6a0 sessionId=0 sessionPasswd=<null> context=0x7fd924001180
flags=0
I0311 16:11:16.148352 58181 state.cpp:33] Recovering state from
'/tmp/mesos/meta'
2015-03-11 16:11:16,153:58147(0x7fd93a5fa700):ZOO_INFO@check_events@1703:
initiated connection to server [10.195.96.237:2181]
2015-03-11 16:11:16,153:58147(0x7fd93a5fa700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.195.96.237:2181],
sessionId=0x14c092b886e000c, negotiated timeout=10000
I0311 16:11:16.153558 58177 group.cpp:313] Group process
(group(1)@10.195.96.237:5051) connected to ZooKeeper
I0311 16:11:16.153606 58177 group.cpp:790] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0311 16:11:16.153625 58177 group.cpp:385] Trying to create path '/mesos' in
ZooKeeper
I0311 16:11:16.158941 58180 slave.cpp:3548] Recovering framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.158977 58177 detector.cpp:138] Detected a new leader: (id='1')
I0311 16:11:16.159014 58180 slave.cpp:4116] Recovering executor
'test-app.753099d3-c7fd-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159150 58177 group.cpp:659] Trying to get
'/mesos/info_0000000001' in ZooKeeper
I0311 16:11:16.159304 58178 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799/runs/43014fa8-3304-42d9-9a61-036f67f0ab37'
for gc 6.99518334248889days in the future
I0311 16:11:16.159323 58180 slave.cpp:4116] Recovering executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159368 58178 gc.cpp:56] Scheduling
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799/runs/43014fa8-3304-42d9-9a61-036f67f0ab37'
for gc 6.99518334201481days in the future
I0311 16:11:16.159409 58178 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799'
for gc 6.99518334167704days in the future
I0311 16:11:16.159430 58178 gc.cpp:56] Scheduling
'/tmp/mesos/meta/slaves/20150311-150951-3982541578-5050-50860-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app.753099d3-c7fd-11e4-946b-56847afe9799'
for gc 6.99518334146963days in the future
I0311 16:11:16.159878 58178 status_update_manager.cpp:197] Recovering status
update manager
I0311 16:11:16.159903 58178 status_update_manager.cpp:205] Recovering executor
'test-app.753099d3-c7fd-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.159957 58178 status_update_manager.cpp:205] Recovering executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.160738 58176 docker.cpp:767] Recovering Docker containers
I0311 16:11:16.160851 58176 docker.cpp:814] Recovering container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.161068 58178 containerizer.cpp:281] Recovering containerizer
I0311 16:11:16.161124 58178 containerizer.cpp:323] Recovering container
'41e12894-611f-4f19-a2e4-721e4ef984c7' for executor
'test-app.2d2d5c64-c800-11e4-946b-56847afe9799' of framework
20150311-150951-3982541578-5050-50860-0000
I0311 16:11:16.163920 58177 detector.cpp:433] A new leading master
([email protected]:5050) is detected
{color:red}I0311 16:11:16.174446 58178 linux_launcher.cpp:169] Couldn't find
freezer cgroup for container 41e12894-611f-4f19-a2e4-721e4ef984c7, assuming
already destroyed
W0311 16:11:16.175457 58178 cpushare.cpp:203] Couldn't find cgroup for
container 41e12894-611f-4f19-a2e4-721e4ef984c7
E0311 16:11:16.177146 58178 containerizer.cpp:1140] Error in a resource
limitation for container 41e12894-611f-4f19-a2e4-721e4ef984c7: Unknown container
I0311 16:11:16.177176 58178 containerizer.cpp:946] Destroying container
'41e12894-611f-4f19-a2e4-721e4ef984c7'
E0311 16:11:16.177522 58178 containerizer.cpp:1140] Error in a resource
limitation for container 41e12894-611f-4f19-a2e4-721e4ef984c7: Unknown
container{color}
I0311 16:11:16.319142 58175 slave.cpp:3407] Sending reconnect request to
executor test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:44967
I0311 16:11:16.322082 58179 slave.cpp:2039] Re-registering executor
test-app.2d2d5c64-c800-11e4-946b-56847afe9799 of framework
20150311-150951-3982541578-5050-50860-0000
W0311 16:11:17.323957 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:17.324007 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
I0311 16:11:18.320060 58179 slave.cpp:2164] Cleaning up un-reregistered
executors
I0311 16:11:18.320137 58179 slave.cpp:3466] Finished recovery
I0311 16:11:18.320715 58179 slave.cpp:602] New master detected at
[email protected]:5050
I0311 16:11:18.320878 58179 slave.cpp:627] No credentials provided. Attempting
to register without authentication
I0311 16:11:18.320901 58179 slave.cpp:638] Detecting new master
I0311 16:11:18.320925 58179 status_update_manager.cpp:171] Pausing sending
status updates
W0311 16:11:18.325335 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:18.325383 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
I0311 16:11:19.072113 58174 slave.cpp:824] Re-registered with master
[email protected]:5050
I0311 16:11:19.072285 58174 slave.cpp:1710] Updating framework
20150311-150951-3982541578-5050-50860-0000 pid to
[email protected]:39327
I0311 16:11:19.072485 58177 status_update_manager.cpp:178] Resuming sending
status updates
I0311 16:11:19.073189 58174 status_update_manager.cpp:178] Resuming sending
status updates
{color:red}W0311 16:11:19.326741 58174 containerizer.cpp:888] Skipping resource
statistic for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown
container
W0311 16:11:19.326815 58174 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:20.327874 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:20.327947 58179 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown container
W0311 16:11:21.329118 58177 containerizer.cpp:888] Skipping resource statistic
for container 41e12894-611f-4f19-a2e4-721e4ef984c7 because: Unknown
container{color}
{color:blue}(USER NOTE: this is printed every seconds in the log, increasing
quickly disk usage){color}
{quote}
> Slave recovery not recovering tasks
> -----------------------------------
>
> Key: MESOS-2419
> URL: https://issues.apache.org/jira/browse/MESOS-2419
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.22.0, 0.23.0
> Reporter: Brenden Matthews
> Assignee: Joerg Schad
> Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> In a recent build from master (updated yesterday), slave recovery appears to
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.247207 10022 docker.cpp:468] Recovering container
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> '20150226-230228-2931198986-5050-717-0000' failed: Container
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for
> container f2001064-e076-4978-b764-ed12a5244e78 of executor
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78'
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:50.599148 10024 composing.cpp:513] Container
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)