David Robinson created AURORA-493:
-------------------------------------
Summary: expose accurate metrics of state transitions
Key: AURORA-493
URL: https://issues.apache.org/jira/browse/AURORA-493
Project: Aurora
Issue Type: Task
Components: Scheduler
Reporter: David Robinson
Priority: Minor
The task store metrics (task_store_*) exposed via http://localhost:8081/vars
aren't accurate enough to be use for alerting purposes. At first glance the
task_store_* metrics look like they could be used to alert on LOST tasks
(task_store_LOST) increasing (among other things), but the numbers actually
decrease as tasks are pruned. If a task becomes lost task_store_LOST is
incremented, but it's also decremented as lost tasks are pruned, therefore if
both increment and decrement occur within an alerting system's polling interval
then the lost task(s) will not be captured.
Consider adding counters of task state transitions that aren't touched when
tasks are pruned -- they should show the entire number of tasks that have
transitioned through, or terminated in each state.
--
This message was sent by Atlassian JIRA
(v6.2#6252)