[
https://issues.apache.org/jira/browse/MESOS-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832374#comment-15832374
]
Neil Conway commented on MESOS-6966:
------------------------------------
The naive solution here would be to decrement {{master/tasks_unreachable}} for
each task that goes back to running when a previously unreachable agent
re-registers. The problem with this is that the master might have failed over,
in which case the task was *not* included in the previous value of the metric.
For example, if the master marks an agent unreachable that is running a single
task, then fails over, then the agent re-registers, decrementing the metric
would yield {{-1}}, which is clearly unreasonable.
We could decrement the value of the metric if it is positive, but that seems
almost as bad.
We could have the metric be a gauge that returns the size of the
{{unreachableTasks}} cache in the master. This wouldn't be terrible, but it
would be inaccurate: we'd be reporting the number of unreachable tasks that the
master has cached in memory, not the "true" number of unreachable tasks. It
seems infeasible to track the "true" number of unreachable tasks, though.
> master/tasks_unreachable metric never decremented
> -------------------------------------------------
>
> Key: MESOS-6966
> URL: https://issues.apache.org/jira/browse/MESOS-6966
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Neil Conway
> Assignee: Neil Conway
> Labels: mesosphere
>
> The {{master/tasks_unreachable}} metric is incremented for each task that was
> running on an agent that is marked unreachable. However, this metric is never
> decremented when/if the agent re-registers. Hence, if an agent is repeatedly
> marked unreachable, the metric will continually increase.
> Basically, this metric is actually recording the *number of times* that this
> master has marked a task unreachable. This is inconsistent with some other
> metrics: for example, {{master/tasks_running}} reports the number of tasks
> that are presently unreachable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)