[ https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Park updated MESOS-8481: -------------------------------- Priority: Blocker (was: Major) Target Version/s: 1.5.0 > Agent reboot during checkpointing may result in empty checkpoints. > ------------------------------------------------------------------ > > Key: MESOS-8481 > URL: https://issues.apache.org/jira/browse/MESOS-8481 > Project: Mesos > Issue Type: Bug > Reporter: Chun-Hung Hsiao > Assignee: Michael Park > Priority: Blocker > > An empty checkpoint file was created due to the following incident. > At 17:12:25, the master assigned a task to an agent: > {noformat} > I0123 17:12:25.000000 18618 master.cpp:11457] Adding task 5602 with resources > cpus(allocated: *):0.1; mem(allocated: *):128 on agent > aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 > (<agent_ip>) > I0123 17:12:25.000000 18618 master.cpp:5017] Launching task 5602 of framework > 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at > scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@<ip>:10015 with resources > [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at > slave(1)@<agent_ip>:5051 (<agent_ip>) > {noformat} > Meanwhile, the agent is being rebooted: > {noformat} > $ last reboot > reboot system boot 3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09 (06:55) > {noformat} > The agent log did not show any information about the task, possibly because > there was no fsync before reboot: > {noformat} > I0123 17:12:09.000000 17237 http.cpp:851] Authorizing principal > 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot' > -- Reboot -- > I0123 17:15:40.000000 2689 logsink.cpp:89] Added FileSink for glog logs to: > /var/log/mesos/mesos-agent.log > {noformat} > However, the agent was checkpointing the task before reboot: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/ > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’ > Size: 39 Blocks: 0 IO Block: 4096 directory > Device: ca40h/51776d Inode: 67306254 Links: 3 > Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-24 00:23:43.237322609 +0000 > Modify: 2018-01-23 17:12:25.751463030 +0000 > Change: 2018-01-23 17:12:25.751463030 +0000 > Birth: - > {noformat} > And since there was no fsync before reboot, all checkpoints resulted in empty > files: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776d Inode: 33967500 Links: 1 > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 17:15:41.485506070 +0000 > Modify: 2018-01-23 17:12:25.749463047 +0000 > Change: 2018-01-23 17:12:25.749463047 +0000 > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776d Inode: 33967495 Links: 1 > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 23:00:42.190975780 +0000 > Modify: 2018-01-23 17:12:25.749463047 +0000 > Change: 2018-01-23 17:12:25.749463047 +0000 > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776d Inode: 67306255 Links: 1 > Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 17:12:25.751463030 +0000 > Modify: 2018-01-23 17:12:25.751463030 +0000 > Change: 2018-01-23 17:12:25.751463030 +0000 > Birth: - > {noformat} > So were {{forked.pid}} and {{task.info}}. > As a result, the agent failed to recover after reboot: > {noformat} > E0123 17:15:41.000000 2709 slave.cpp:6800] EXIT with status 1: Failed to > perform recovery: Failed to recover framework > 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from > '/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info': > Found an empty file > {noformat} > The error came from > [https://github.com/apache/mesos/blob/a892a2e80255291e6cd5cb3b0e90b9a029989c99/3rdparty/stout/include/stout/protobuf.hpp#L367]. > This is related to a recent change of the checkpoint recovery logic: we used > to ignore empty checkpoint files before, but now it return errors on empty > checkpoint files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)