Zhu Zhu created FLINK-14206:
-------------------------------
Summary: Make fullRestart metric to count fine grained restarts as
well
Key: FLINK-14206
URL: https://issues.apache.org/jira/browse/FLINK-14206
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Zhu Zhu
Fix For: 1.9.1
With fine grained recovery introduced in 1.9.0, the {{fullRestart}} metric only
counts how many times the entire graph has been restarted, not including the
number of fine grained failure restarts.
As many users leverage this metric for failure detecting monitoring and
alerting, I'd propose to make it also count fine grained failure restarts.
The concrete proposal is:
1. Add a counter {{numberOfRestartCounter}} in ExecutionGraph to count all
restarts. The counter is not to be registered to metric groups.
2. Let {{fullRestart}} query the value of the counter, instead of
{{ExecutionGraph#globalModVersion}}
3. increment {{numberOfRestartCounter}} in {{ExecutionGraph#failGlobal}}
4. increment {{numberOfRestartCounter}} in
{{ExecutionGraph#notifyExecutionChange}} where notifying the failover strategy,
or maybe in {{AdaptedRestartPipelinedRegionStrategyNG}} to only count those
failover really happens
--
This message was sent by Atlassian Jira
(v8.3.4#803005)