Jordan Ly created AURORA-1932:
---------------------------------
Summary: Failure accrual detection mechanism for bad agents
Key: AURORA-1932
URL: https://issues.apache.org/jira/browse/AURORA-1932
Project: Aurora
Issue Type: Story
Components: Scheduler
Reporter: Jordan Ly
Assignee: Jordan Ly
With the introduction of different OfferManager orderings (see
https://reviews.apache.org/r/59480/), we run the risk of repeatedly assigning
the same task to a bad agent.
We should develop some sort of 'failure accrual' mechanism where we can track
how many times tasks fail on a agent. If it reaches some sort of threshold, we
should blacklist that agent for some time so that it can be investigated and
the task can be assigned to a different agent.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)