Jiamin, here it is 5*15 health check timeout and not the reregister use case.
So the question is executor not receiving shutdown message? And also it is getting terminated immediately without any graceful shutdown time? This is mesos 1.1 and in context of custom executor for mesos containerizer. Thx > On Jul 15, 2017, at 9:29 PM, Zhu, Jiamin <jiam...@paypal.com.INVALID> wrote: > > Hi all, > Hope you all doing well! > We got an issue about mesos executor. We shutdown mesos slave for couple of > mins(more than agent_reregister_timeout), mesos slave sent shutdown message > to executor when we brought mesos slave back, but we didn’t see any expected > log print out from executor side. And executor process was killed really > quick after mesos slave became online. > > I would greatly appreciate if you can kindly point us what’s going on. Thx! > > Here are relevant configuration and log > agent_reregister_timeout=75secs > executor_shutdown_grace_period=30secs > checkingpoint is enabled. > > > Log from mesos-slave > I0713 11:24:14.727005 23237 slave.cpp:3421] Re-registering executor > 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:14.727247 23237 slave.cpp:3634] Handling status update > TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 from executor(1)@10.180.72.144:45285 > I0713 11:24:14.728474 23237 memory.cpp:199] Updated > 'memory.soft_limit_in_bytes' to 4352MB for container > c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:14.728746 23238 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:14.728801 23238 status_update_manager.cpp:832] Checkpointing > UPDATE for status update TASK_RUNNING (UUID: > ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > W0713 11:24:14.728997 23238 slave.cpp:3997] Dropping status update > TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 sent by status update manager > because the agent is in RECOVERING state > I0713 11:24:14.729033 23238 slave.cpp:3961] Sending acknowledgement for > status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for > task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 to > executor(1)@10.180.72.144:45285 > I0713 11:24:14.729878 23237 cpu.cpp:101] Updated 'cpu.shares' to 1280 (cpus > 1.25) for container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:16.726043 23240 slave.cpp:3574] Cleaning up un-reregistered > executors > I0713 11:24:16.726153 23240 slave.cpp:5281] Finished recovery > I0713 11:24:16.726739 23240 slave.cpp:915] New master detected at > master@10.180.72.104:5050 > I0713 11:24:16.726766 23240 slave.cpp:974] Authenticating with master > master@10.180.72.104:5050 > I0713 11:24:16.726883 23240 slave.cpp:985] Using default CRAM-MD5 > authenticatee > I0713 11:24:16.726934 23240 slave.cpp:947] Detecting new master > I0713 11:24:16.726963 23240 status_update_manager.cpp:177] Pausing sending > status updates > I0713 11:24:16.727108 23240 authenticatee.cpp:97] Initializing client SASL > I0713 11:24:16.728171 23240 authenticatee.cpp:121] Creating new client SASL > connection > I0713 11:24:16.730295 23241 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > I0713 11:24:16.730348 23241 authenticatee.cpp:239] Attempting to authenticate > with mechanism 'CRAM-MD5' > I0713 11:24:16.731009 23242 authenticatee.cpp:259] Received SASL > authentication step > I0713 11:24:16.731638 23244 authenticatee.cpp:299] Authentication success > I0713 11:24:16.731791 23243 slave.cpp:1069] Successfully authenticated with > master master@10.180.72.104:5050 > I0713 11:24:16.746140 23242 slave.cpp:1217] Re-registered with master > master@10.180.72.104:5050 > I0713 11:24:16.746206 23242 slave.cpp:1253] Forwarding total oversubscribed > resources {} > I0713 11:24:16.746295 23242 slave.cpp:2536] Shutting down framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:16.746320 23242 slave.cpp:4860] Shutting down executor > 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 at > executor(1)@10.180.72.144:45285 > W0713 11:24:16.746402 23242 slave.cpp:2676] Ignoring updating pid for > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 because it is terminating > I0713 11:24:16.746435 23242 status_update_manager.cpp:184] Resuming sending > status updates > W0713 11:24:16.746443 23242 status_update_manager.cpp:191] Resending status > update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:16.746513 23242 slave.cpp:4051] Forwarding the update > TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 to master@10.180.72.104:5050 > I0713 11:24:16.747684 23244 slave.cpp:4179] Got exited event for > executor(1)@10.180.72.144:45285 > I0713 11:24:16.749622 23240 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task > paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework > 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:16.749706 23240 status_update_manager.cpp:832] Checkpointing ACK > for status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) > for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 > I0713 11:24:16.831954 23241 containerizer.cpp:2313] Container > c5b88043-6e46-4bde-bb8d-96dbaafe0e99 has exited > I0713 11:24:16.832000 23241 containerizer.cpp:1950] Destroying container > c5b88043-6e46-4bde-bb8d-96dbaafe0e99 in RUNNING state > I0713 11:24:16.832098 23241 linux_launcher.cpp:498] Asked to destroy > container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:16.832731 23241 linux_launcher.cpp:541] Using freezer to destroy > cgroup mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:16.833940 23241 cgroups.cpp:2705] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 > I0713 11:24:16.833946 23244 cgroups.cpp:2705] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 > I0713 11:24:16.834002 23243 cgroups.cpp:2705] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c > I0713 11:24:16.834003 23242 cgroups.cpp:2705] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:16.837601 23244 cgroups.cpp:1439] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c > after 1.716992ms > I0713 11:24:16.837877 23242 cgroups.cpp:1439] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 after > 1.916928ms > I0713 11:24:16.839395 23244 cgroups.cpp:2723] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c > I0713 11:24:16.839664 23242 cgroups.cpp:2723] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 > I0713 11:24:16.841399 23244 cgroups.cpp:1468] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c > after 1.956864ms > I0713 11:24:16.843209 23242 cgroups.cpp:1468] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 after > 3.491072ms > I0713 11:24:16.937427 23240 cgroups.cpp:1439] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 > after 103.37408ms > I0713 11:24:16.937711 23242 cgroups.cpp:1439] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 > after 103.67488ms > I0713 11:24:16.939801 23240 cgroups.cpp:2723] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 > I0713 11:24:16.940266 23242 cgroups.cpp:2723] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 > I0713 11:24:16.942387 23240 cgroups.cpp:1468] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 > after 2.450176ms > I0713 11:24:16.943276 23242 cgroups.cpp:1468] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 > after 2.972928ms > E0713 11:24:16.986686 23244 slave.cpp:4520] Termination of executor > 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 failed: Failed to kill > all processes in the container: Failed to kill tasks in nested cgroups: > Collect failed: Failed to kill all processes in cgroup: Failed to read > cgroups control 'cgroup.procs': 'mesos/c5b88043-6e46-4bde-bb8d-9 > 6dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c' > is not a valid cgroup > I0713 11:24:16.986958 23244 slave.cpp:4646] Cleaning up executor > 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of > framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 at > executor(1)@10.180.72.144:45285 > I0713 11:24:16.987180 23242 gc.cpp:55] Scheduling > '/var/lib/mesos/slaves/3ae40afd-a4c7-4f34-8668-2f513762b2a6-S87/frameworks/3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000/executors/compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0/runs/c5b88043-6e46-4bde-bb8d-96dbaafe0e99' > for gc 6.99998857482667days in the future > I0713 11:24:16.987231 23242 gc.cpp:55] Scheduling > '/var/lib/mesos/slaves/3ae40afd-a4c7-4f34-8668-2f513762b2a6-S87/frameworks/3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000/executors/compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' > for gc 6.99998857391111days in the future >