Correct some configuration. agent_ping_timeout * max_agent_ping_timeouts = 75secs
Thx, Jiamin From: "Zhu, Jiamin" <jiam...@paypal.com> Date: Saturday, July 15, 2017 at 9:29 PM To: dev mesos <dev@mesos.apache.org> Cc: Meghdoot Bhattacharya <mbhattacha...@paypal.com> Subject: question Hi all, Hope you all doing well! We got an issue about mesos executor. We shutdown mesos slave for couple of mins(more than agent_reregister_timeout), mesos slave sent shutdown message to executor when we brought mesos slave back, but we didn’t see any expected log print out from executor side. And executor process was killed really quick after mesos slave became online. I would greatly appreciate if you can kindly point us what’s going on. Thx! Here are relevant configuration and log agent_reregister_timeout=75secs executor_shutdown_grace_period=30secs checkingpoint is enabled. Log from mesos-slave I0713 11:24:14.727005 23237 slave.cpp:3421] Re-registering executor 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:14.727247 23237 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 from executor(1)@10.180.72.144:45285 I0713 11:24:14.728474 23237 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 4352MB for container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:14.728746 23238 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:14.728801 23238 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 W0713 11:24:14.728997 23238 slave.cpp:3997] Dropping status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 sent by status update manager because the agent is in RECOVERING state I0713 11:24:14.729033 23238 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 to executor(1)@10.180.72.144:45285 I0713 11:24:14.729878 23237 cpu.cpp:101] Updated 'cpu.shares' to 1280 (cpus 1.25) for container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:16.726043 23240 slave.cpp:3574] Cleaning up un-reregistered executors I0713 11:24:16.726153 23240 slave.cpp:5281] Finished recovery I0713 11:24:16.726739 23240 slave.cpp:915] New master detected at master@10.180.72.104:5050 I0713 11:24:16.726766 23240 slave.cpp:974] Authenticating with master master@10.180.72.104:5050 I0713 11:24:16.726883 23240 slave.cpp:985] Using default CRAM-MD5 authenticatee I0713 11:24:16.726934 23240 slave.cpp:947] Detecting new master I0713 11:24:16.726963 23240 status_update_manager.cpp:177] Pausing sending status updates I0713 11:24:16.727108 23240 authenticatee.cpp:97] Initializing client SASL I0713 11:24:16.728171 23240 authenticatee.cpp:121] Creating new client SASL connection I0713 11:24:16.730295 23241 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 I0713 11:24:16.730348 23241 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' I0713 11:24:16.731009 23242 authenticatee.cpp:259] Received SASL authentication step I0713 11:24:16.731638 23244 authenticatee.cpp:299] Authentication success I0713 11:24:16.731791 23243 slave.cpp:1069] Successfully authenticated with master master@10.180.72.104:5050 I0713 11:24:16.746140 23242 slave.cpp:1217] Re-registered with master master@10.180.72.104:5050 I0713 11:24:16.746206 23242 slave.cpp:1253] Forwarding total oversubscribed resources {} I0713 11:24:16.746295 23242 slave.cpp:2536] Shutting down framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:16.746320 23242 slave.cpp:4860] Shutting down executor 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 at executor(1)@10.180.72.144:45285 W0713 11:24:16.746402 23242 slave.cpp:2676] Ignoring updating pid for framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 because it is terminating I0713 11:24:16.746435 23242 status_update_manager.cpp:184] Resuming sending status updates W0713 11:24:16.746443 23242 status_update_manager.cpp:191] Resending status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:16.746513 23242 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 to master@10.180.72.104:5050 I0713 11:24:16.747684 23244 slave.cpp:4179] Got exited event for executor(1)@10.180.72.144:45285 I0713 11:24:16.749622 23240 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:16.749706 23240 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ab4daeff-67f7-11e7-947f-42010ab44890) for task paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0 of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 I0713 11:24:16.831954 23241 containerizer.cpp:2313] Container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 has exited I0713 11:24:16.832000 23241 containerizer.cpp:1950] Destroying container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 in RUNNING state I0713 11:24:16.832098 23241 linux_launcher.cpp:498] Asked to destroy container c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:16.832731 23241 linux_launcher.cpp:541] Using freezer to destroy cgroup mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:16.833940 23241 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 I0713 11:24:16.833946 23244 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 I0713 11:24:16.834002 23243 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c I0713 11:24:16.834003 23242 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:16.837601 23244 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c after 1.716992ms I0713 11:24:16.837877 23242 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 after 1.916928ms I0713 11:24:16.839395 23244 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c I0713 11:24:16.839664 23242 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 I0713 11:24:16.841399 23244 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c after 1.956864ms I0713 11:24:16.843209 23242 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99 after 3.491072ms I0713 11:24:16.937427 23240 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 after 103.37408ms I0713 11:24:16.937711 23242 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 after 103.67488ms I0713 11:24:16.939801 23240 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 I0713 11:24:16.940266 23242 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 I0713 11:24:16.942387 23240 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/6c0550ac53dc5001953bea3591b1192833591aff59b1cf57e755b926d48cca12 after 2.450176ms I0713 11:24:16.943276 23242 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/c5b88043-6e46-4bde-bb8d-96dbaafe0e99/5bb4c6ae725bea06daf6faeca2dfec7d058e504505c2bb6b54aba831ce5c3d27 after 2.972928ms E0713 11:24:16.986686 23244 slave.cpp:4520] Termination of executor 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 failed: Failed to kill all processes in the container: Failed to kill tasks in nested cgroups: Collect failed: Failed to kill all processes in cgroup: Failed to read cgroups control 'cgroup.procs': 'mesos/c5b88043-6e46-4bde-bb8d-9 6dbaafe0e99/c0a8fa6f2fdaeb2963781ec39b738b665459911f07245c7254dddddbc568251c' is not a valid cgroup I0713 11:24:16.986958 23244 slave.cpp:4646] Cleaning up executor 'compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' of framework 3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000 at executor(1)@10.180.72.144:45285 I0713 11:24:16.987180 23242 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/3ae40afd-a4c7-4f34-8668-2f513762b2a6-S87/frameworks/3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000/executors/compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0/runs/c5b88043-6e46-4bde-bb8d-96dbaafe0e99' for gc 6.99998857482667days in the future I0713 11:24:16.987231 23242 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/3ae40afd-a4c7-4f34-8668-2f513762b2a6-S87/frameworks/3ae40afd-a4c7-4f34-8668-2f513762b2a6-0000/executors/compose-paypal-iptest-raptorapp-a-0-c1e7663a-cc2e-4856-9b31-534dd3da22d0' for gc 6.99998857391111days in the future