Fei Feng created FLINK-31482: -------------------------------- Summary: support count jobmanager-failed failover times Key: FLINK-31482 URL: https://issues.apache.org/jira/browse/FLINK-31482 Project: Flink Issue Type: Improvement Components: Runtime / Coordination, Runtime / Metrics Affects Versions: 1.16.1 Reporter: Fei Feng
we have a metric `numRestarts` which indicate how many times a job failover , but we don't have a metric indicate the job recover from ha ( high availability). there are two problems: 1. when a jobmanager process crashed , we have no way of knowing that jobmanager is crash and job was recovered from metric system 2. when a new jobmanager become leader, the `numRestarts` will started from zero, Sometimes misleading our users。most user think that whether failover because of a JM failure or because of a job failure, these failover is same , the effect, at least, is the same. I suggest we can 1. add new metric that indicate how many time the job was recovered from ha 2. metric `numRestarts` also count the times recover from ha -- This message was sent by Atlassian Jira (v8.20.10#820010)