Fei Feng created FLINK-31482:
--------------------------------
Summary: support count jobmanager-failed failover times
Key: FLINK-31482
URL: https://issues.apache.org/jira/browse/FLINK-31482
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination, Runtime / Metrics
Affects Versions: 1.16.1
Reporter: Fei Feng
we have a metric `numRestarts` which indicate how many times a job failover ,
but we don't have a metric indicate the job recover from ha ( high
availability).
there are two problems:
1. when a jobmanager process crashed , we have no way of knowing that
jobmanager is crash and job was recovered from metric system
2. when a new jobmanager become leader, the `numRestarts` will started from
zero,
Sometimes misleading our users。most user think that whether failover because of
a JM failure or because of a job failure, these failover is same , the effect,
at least, is the same.
I suggest we can
1. add new metric that indicate how many time the job was recovered from ha
2. metric `numRestarts` also count the times recover from ha
--
This message was sent by Atlassian Jira
(v8.20.10#820010)