Setting `failover_timeout` is key. The Apache Aurora framework defaults this value to 21 days to ensure there is no accidental destruction of tasks in a production environment. FWIW, I think the default is terrible and not desirable. I really think frameworks should opt in to this behaviour than opt out. A minor ZK or network blip can cause destruction of tasks by default.
On Wed, Feb 10, 2016 at 5:05 PM, Shuai Lin <[email protected]> wrote: > Hi suppandi, > > To make sure your tasks survive framework restarts, you need to: > > 1. When registering your framework, set `failover_timeout` attribute of > the FrameworkInfo PB. This is how long the master would wait for your > framework to reconnect. By default it's 0, that's why your tasks are killed > immediately when the framework exits. > > 2. When you reregister your framework, You need to use the same framework > id as the previous run, so that the master can identify it's the framework > reconnecting. > > Regards, > Shuai > > > On Thu, Feb 11, 2016 at 6:37 AM, suppandi <[email protected]> wrote: > > > Hi, > > > > I am trying to write my first framework and i wanted to test task > > reconciliation. But whenever i kill my framework (with a kill -9), mesos > > seems to cleanup the tasks by updating its state to TASK_KILLED. > > > > Is there a parameter when creating the framework or the task that makes > > this happen? I want my task to remain alive when the framework is > > disconnected/dead. > > > > Here is how i create my framework > > https://gist.github.com/anonymous/3357783ce938c4293947 > > > > and here is how i create my task > > https://gist.github.com/anonymous/d35f917ade791127f4c5 > > > > Thanks > > suppandi > > > > -- > Zameer Manji > >
