Re: Agent reregistration timeout, no TASK_LOST messages

2017-11-20 Thread Meghdoot bhattacharya
> AFAIK the absence of TASK_LOST statuses is expected. Master registry > persists information only about agents. Tasks are recovered from > re-registering agents. Because of that the failed over master can't send > TASK_LOST for tasks that were running on the agent that didn't re-register, > it

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-18 Thread Vinod Kone
Great! It's probably worthwhile to improve our reconciliation doc to suggest that frameworks do explicit reconciliation on slave lost messages as well. On Tue, Jul 18, 2017 at 12:27 PM, Ilya Pronin wrote: > Vinod, sure, I'd like to. I'll also look into MESOS-6406 that

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-18 Thread Ilya Pronin
Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil mentioned, when I have time. If no one does that before :) On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone wrote: > On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya < > meghdoo...@yahoo.com.invalid>

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Vinod Kone
On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya < meghdoo...@yahoo.com.invalid> wrote: > When there is no master fail over and agents join back after the default > 5*15 timeout, we do see tasks getting killed like it used to. Because in > this case master has sent task lost to framework. >

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Vinod Kone
On Mon, Jul 17, 2017 at 12:48 PM, Ilya Pronin wrote: > BTW, the doc seems to be a bit outdated. It mentions shutting down agents > that try to re-register after being removed due to failed health checks, > which is no longer true. Plus there's nothing about partition

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Yan Xu
On Mon, Jul 17, 2017 at 9:34 AM, Neil Conway wrote: > On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin > wrote: > > > AFAIK the absence of TASK_LOST statuses is expected. Master registry > > persists information only about agents. Tasks are recovered

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Meghdoot bhattacharya
Can you clarify on this statement a bit "BTW, the doc seems to be a bit outdated. It mentions shutting down agents that try to re-register after being removed due to failed health checks, which is no longer true." When there is no master fail over and agents join back after the default 5*15

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread David McLaughlin
Sorry, I misread this. Thanks for the explanation, it makes sense now. I guess reconciliation is the only way to handle this. It would be good to update the docs to reflect the new/current behavior. On 2017-07-17 09:20 (-0700), Ilya Pronin wrote: > Hi,> > > AFAIK the

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Ilya Pronin
The old code doesn't look like it was able to send TASK_LOST updates in such situation either. The failed over master simply doesn't have enough information to do it, because it never heard from the agent that can tell the master about its tasks. Could it be that the doc refers to TASK_LOST

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread David McLaughlin
Not sending TASK_LOST is a breaking change compared to previous behavior. >From the docs here: http://mesos.apache.org/documentation/latest/high- availability-framework-guide/ When it is time to remove an agent, the master removes the agent from the > list of registered agents in the master’s

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Neil Conway
On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin wrote: > AFAIK the absence of TASK_LOST statuses is expected. Master registry > persists information only about agents. Tasks are recovered from > re-registering agents. Because of that the failed over master can't send >

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-17 Thread Ilya Pronin
Hi, AFAIK the absence of TASK_LOST statuses is expected. Master registry persists information only about agents. Tasks are recovered from re-registering agents. Because of that the failed over master can't send TASK_LOST for tasks that were running on the agent that didn't re-register, it simply

Re: Agent reregistration timeout, no TASK_LOST messages

2017-07-15 Thread Meghdoot bhattacharya
This looks like a serious bug unless we are missing something. Hoping for clarifications. Thx > On Jul 14, 2017, at 3:52 PM, Renan DelValle wrote: > > Hi all, > > We're using Mesos 1.1.0 and have observed some unexpected behavior with > regards to Agent

Agent reregistration timeout, no TASK_LOST messages

2017-07-14 Thread Renan DelValle
Hi all, We're using Mesos 1.1.0 and have observed some unexpected behavior with regards to Agent reregistration on our cluster. When a health check failure happens, our framework (in this case Apache Aurora) receives an Agent Lost message along with TASK_LOST messages for each of the tasks that