[ 
https://issues.apache.org/jira/browse/FLINK-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334310#comment-16334310
 ] 

Till Rohrmann commented on FLINK-8043:
--------------------------------------

Hi [~stevenz3wu], the reason why one metric is a gauge and the other is a 
counter is simple. The {{task_failures}} need to be a counter since we have to 
store somewhere the count of task failures. This does not happen so far and, 
thus, we have to use a counter. In contrast to task failures, the number of 
global recoveries is stored within the {{ExecutionGraph}}. Therefore, we only 
need to expose it as a gauge which returns this value. Does this make sense?

 

The naming part is indeed inconsistent and could be corrected.

> change fullRestarts (for fine grained recovery) from guage to counter
> ---------------------------------------------------------------------
>
>                 Key: FLINK-8043
>                 URL: https://issues.apache.org/jira/browse/FLINK-8043
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>    Affects Versions: 1.3.2
>            Reporter: Steven Zhen Wu
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.1
>
>
> Fine grained recovery publish fullRestarts as guage, which is not suitable 
> for threshold based alerting. Usually we would alert like "fullRestarts > 0 
> happens 10 times in last 15 minutes".
> In comparison, "task_failures" is published as counter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to