Zhu Zhu created FLINK-14164:
-------------------------------
Summary: Add a metric to show failover count regarding fine
grained recovery
Key: FLINK-14164
URL: https://issues.apache.org/jira/browse/FLINK-14164
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination, Runtime / Metrics
Affects Versions: 1.9.0, 1.10.0
Reporter: Zhu Zhu
Fix For: 1.10.0
Previously Flink uses restart all strategy to recover jobs from failures. And
the metric "fullRestart" is used to show the count of failovers.
However, with fine grained recovery introduced in 1.9.0, the "fullRestart"
metric only reveals how many times the entire graph has been restarted, not
including the number of fine grained failure recoveries.
As many users want to build their job alerting based on failovers, I'd propose
to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also
respects fine grained recoveries.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)