> AFAIK the absence of TASK_LOST statuses is expected. Master registry
> persists information only about agents. Tasks are recovered from
> re-registering agents. Because of that the failed over master can't send
> TASK_LOST for tasks that were running on the agent that didn't re-register,
> it
Great!
It's probably worthwhile to improve our reconciliation doc to suggest that
frameworks do explicit reconciliation on slave lost messages as well.
On Tue, Jul 18, 2017 at 12:27 PM, Ilya Pronin
wrote:
> Vinod, sure, I'd like to. I'll also look into MESOS-6406 that
Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil
mentioned, when I have time. If no one does that before :)
On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone wrote:
> On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid>
On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:
> When there is no master fail over and agents join back after the default
> 5*15 timeout, we do see tasks getting killed like it used to. Because in
> this case master has sent task lost to framework.
>
On Mon, Jul 17, 2017 at 12:48 PM, Ilya Pronin
wrote:
> BTW, the doc seems to be a bit outdated. It mentions shutting down agents
> that try to re-register after being removed due to failed health checks,
> which is no longer true. Plus there's nothing about partition
On Mon, Jul 17, 2017 at 9:34 AM, Neil Conway wrote:
> On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin
> wrote:
>
> > AFAIK the absence of TASK_LOST statuses is expected. Master registry
> > persists information only about agents. Tasks are recovered
Can you clarify on this statement a bit
"BTW, the doc seems to be a bit outdated. It mentions shutting down agents
that try to re-register after being removed due to failed health checks,
which is no longer true."
When there is no master fail over and agents join back after the default 5*15
Sorry, I misread this. Thanks for the explanation, it makes sense now.
I guess reconciliation is the only way to handle this.
It would be good to update the docs to reflect the new/current behavior.
On 2017-07-17 09:20 (-0700), Ilya Pronin wrote:
> Hi,>
>
> AFAIK the
The old code doesn't look like it was able to send TASK_LOST updates in
such situation either. The failed over master simply doesn't have enough
information to do it, because it never heard from the agent that can tell
the master about its tasks. Could it be that the doc refers to TASK_LOST
Not sending TASK_LOST is a breaking change compared to previous behavior.
>From the docs here:
http://mesos.apache.org/documentation/latest/high-
availability-framework-guide/
When it is time to remove an agent, the master removes the agent from the
> list of registered agents in the master’s
On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin
wrote:
> AFAIK the absence of TASK_LOST statuses is expected. Master registry
> persists information only about agents. Tasks are recovered from
> re-registering agents. Because of that the failed over master can't send
>
Hi,
AFAIK the absence of TASK_LOST statuses is expected. Master registry
persists information only about agents. Tasks are recovered from
re-registering agents. Because of that the failed over master can't send
TASK_LOST for tasks that were running on the agent that didn't re-register,
it simply
This looks like a serious bug unless we are missing something. Hoping for
clarifications.
Thx
> On Jul 14, 2017, at 3:52 PM, Renan DelValle wrote:
>
> Hi all,
>
> We're using Mesos 1.1.0 and have observed some unexpected behavior with
> regards to Agent
Hi all,
We're using Mesos 1.1.0 and have observed some unexpected behavior with
regards to Agent reregistration on our cluster.
When a health check failure happens, our framework (in this case Apache
Aurora) receives an Agent Lost message along with TASK_LOST messages for
each of the tasks that
14 matches
Mail list logo