+1 to A and B Aurora has enabled checkpointing for years and requires operators to enable checkpointing on the slaves.
On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <jo...@mesosphere.io> wrote: > I'm in favor of A & B. I find it provides a better "first experience" to > users. > From my experience you usually have to have an explicit reason to not want > to checkpoint. Most people assume the semantics provided by the checkpoint > behavior is default and it can be a frustrating experience for them to find > out that is not the case. > > — > *Joris Van Remoortere* > Mesosphere > > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <neil.con...@gmail.com> > wrote: > >> Hi folks, >> >> I'd like input from individuals who currently use frameworks but do >> not enable checkpointing. >> >> Background: "checkpointing" is a parameter that can be enabled in >> FrameworkInfo; if enabled, the agent will write the framework pid, >> executor PIDs, and status updates to disk for any tasks started by >> that framework. This checkpointed information means that these tasks >> can survive an agent crash: if the agent exits (whether due to >> crashing or as part of an upgrade procedure), a restarted agent can >> use this information to reconnect to executors started by the previous >> instance of the agent. The downside is that checkpointing requires >> some additional disk I/O at the agent. >> >> Checkpointing is not currently the default, but in my experience it is >> often enabled for production frameworks. As part of the work on >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are >> considering: >> >> (a) requiring that partition-aware frameworks must also enable >> checkpointing, and/or >> (b) enabling checkpointing by default >> >> If you have intentionally decided to disable checkpointing for your >> Mesos framework, I'd be curious to hear more about your use-case and >> why you haven't enabled it. >> >> Thanks! >> >> Neil >> >> -- >> Zameer Manji >> >