[ 
https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970998#comment-16970998
 ] 

Zhu Zhu commented on FLINK-14164:
---------------------------------

Compared to {{numberOfRestarts}} metric, {{fullRestarts}} will increment in 3 
extra cases: job cancel, suspend and the last suppressed restart. 
For legacy scheduler, it is the same with {{fullRestarts}}. 
For ng scheduler, it will not increment in the 3 extra cases. But maybe this is 
better since no restart(and sometimes even no failure) happens in these cases.
How to about to just notate this behavior change when announcing this new 
metric in release note?

> Add a metric to show failover count regarding fine grained recovery
> -------------------------------------------------------------------
>
>                 Key: FLINK-14164
>                 URL: https://issues.apache.org/jira/browse/FLINK-14164
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Previously Flink uses restart all strategy to recover jobs from failures. And 
> the metric "fullRestart" is used to show the count of failovers.
> However, with fine grained recovery introduced in 1.9.0, the "fullRestart" 
> metric only reveals how many times the entire graph has been restarted, not 
> including the number of fine grained failure recoveries.
> As many users want to build their job alerting based on failovers, I'd 
> propose to add such a new metric {{numberOfRestarts}} which also respects 
> fine grained recoveries. The metric should be a Gauge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to