Re: Agent reregistration timeout, no TASK_LOST messages

Meghdoot bhattacharya Mon, 20 Nov 2017 00:19:23 -0800

> AFAIK the absence of TASK_LOST statuses is expected. Master registry
> persists information only about agents. Tasks are recovered from
> re-registering agents. Because of that the failed over master can't send
> TASK_LOST for tasks that were running on the agent that didn't re-register,
> it simply doesn't know about them. The only thing the master can do in this
> situation is send LostSlaveMessage that will tell the scheduler that tasks
> on this agent are LOST/UNREACHABLE.
>


+1.
—————-

Probably this got never tracked into a Jira that at least a slave lost message 
is sent on slave reregistration timeout on master failover. At least frameworks 
can start recon. This can lead to potential pool depletions in prod if slaves 
do not really come back and task lost not detected quickly.

If this has not been addressed let me file a Jira. Or can it be bundled with 
6406 though this applies mostly for non partition aware tasks.

Thx

> On Jul 18, 2017, at 12:35 PM, Vinod Kone <vinodk...@apache.org> wrote:
> 
> Great!
> 
> It's probably worthwhile to improve our reconciliation doc to suggest that
> frameworks do explicit reconciliation on slave lost messages as well.
> 
> On Tue, Jul 18, 2017 at 12:27 PM, Ilya Pronin <ipro...@twopensource.com>
> wrote:
> 
>> Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil
>> mentioned, when I have time. If no one does that before :)
>> 
>>> On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone <vinodk...@apache.org> wrote:
>>> 
>>> On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
>>> meghdoo...@yahoo.com.invalid> wrote:
>>> 
>>>> When there is no master fail over and agents join back after the
>> default
>>>> 5*15 timeout, we do see tasks getting killed like it used to. Because
>> in
>>>> this case master has sent task lost to framework.
>>>> But we are noticing shutdown() executor callback not getting invoked.
>> We
>>>> started a different thread on it. This is mesos 1.1.
>>>> 
>>>> Are you trying to say tasks will leak in latest versions and again
>> relies
>>>> on recon for the regular health check timeout scenario and agent
>> joining
>>>> back?
>>>> 
>>> 
>>> There should be no task leaks. After partition awareness code has landed,
>>> the master no longer shuts down the agents in the above scenario but it
>>> still shuts down the tasks/executors of the non-partition-aware
>> frameworks.
>>> So the observable behavior for a framework regarding its tasks/executors
>>> should not change. The one observable change is that frameworks do not
>> get
>>> `LostSlaveMessage` (`lostSlave()` callback on the driver) in this case.
>>> 
>> 
>> --
>> Ilya Pronin
>>

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to