Hi folks, I'd like input from individuals who currently use frameworks but do not enable checkpointing.
Background: "checkpointing" is a parameter that can be enabled in FrameworkInfo; if enabled, the agent will write the framework pid, executor PIDs, and status updates to disk for any tasks started by that framework. This checkpointed information means that these tasks can survive an agent crash: if the agent exits (whether due to crashing or as part of an upgrade procedure), a restarted agent can use this information to reconnect to executors started by the previous instance of the agent. The downside is that checkpointing requires some additional disk I/O at the agent. Checkpointing is not currently the default, but in my experience it is often enabled for production frameworks. As part of the work on supporting partition-aware Mesos frameworks (see MESOS-4049), we are considering: (a) requiring that partition-aware frameworks must also enable checkpointing, and/or (b) enabling checkpointing by default If you have intentionally decided to disable checkpointing for your Mesos framework, I'd be curious to hear more about your use-case and why you haven't enabled it. Thanks! Neil