Bill Farner created AURORA-302:
----------------------------------
Summary: TaskGroups may abandon tasks
Key: AURORA-302
URL: https://issues.apache.org/jira/browse/AURORA-302
Project: Aurora
Issue Type: Bug
Components: Scheduler
Reporter: Bill Farner
I've yet to figure out exactly how this happens, but i've witnessed this twice
successively in vagrant (but was unable to repro while trying to debug it), and
once in production.
TaskGroups appears to have a bug that causes it to keep a group in the
{{groups}} data structure, but with no corresponding async task in
{{executor}}. The design of TaskGroups is such that each task group must
~always be represented in both (almost always because the executor entry will
be absent briefly while trying to schedule a task).
The one i observed in production looked like this (in /pendingtasks):
{noformat}
{
penaltyMs: 30000,
name: "role/env/job",
taskIds: [ ]
},
{noformat}
When i saw it in vagrant:
{noformat}
{
penaltyMs: 1,
name: "role/env/job",
taskIds: [ ]
},
{noformat}
Additionally, the {{schedule_queue_size}} in vagrant was consistently zero when
i observed this, further supporting the hypothesis that the group was not being
evaluated.
TaskGroups is intended to invalidate empty groups, so the mere presence of one
suggests that it has been dropped.
--
This message was sent by Atlassian JIRA
(v6.2#6252)