Renan DelValle created AURORA-1942:
--------------------------------------
Summary: Improve Aurora behavior with regards to Mesos Agents
violating reregistration timeouts
Key: AURORA-1942
URL: https://issues.apache.org/jira/browse/AURORA-1942
Project: Aurora
Issue Type: Task
Components: Scheduler
Reporter: Renan DelValle
A Mesos Agent Lost message can be received in two scenarios resulting in
different outcomes:
1) A Mesos Agent can fail the health check done by the Mesos Master
(max_agent_ping_timeouts violation) which leads to an Agent Lost message along
with TASK_LOST messages for each task running on the unhealthy Agent.
2) A Mesos Agent can fail to re-register after an election has taken place
(agent_reregister_timeout violation). In this situation the newly elected Mesos
master, because Master's do not store any information concerning the tasks that
are currently running, is unable to send a TASK_LOST message for the tasks that
were running on the Agent that failed to re-register.
Scenario number 2 can lead to (a) "missing" instances for the tasks scheduled
on the rogue Agent until an explicit reconciliation is done and/or (b) "leaked"
tasks if the Agent re-registers after Aurora has replaced the missing tasks
that will only be cleaned upon an implicit reconciliation.
For (a), one solution is to transition tasks in a missing Agent to the LOST
state upon receiving a Slave Lost message.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)