[jira] [Created] (FLINK-31482) support count jobmanager-failed failover times

Fei Feng (Jira) Wed, 15 Mar 2023 22:38:30 -0700

Fei Feng created FLINK-31482:
--------------------------------

             Summary: support count jobmanager-failed failover times
                 Key: FLINK-31482
                 URL: https://issues.apache.org/jira/browse/FLINK-31482
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination, Runtime / Metrics
    Affects Versions: 1.16.1
            Reporter: Fei Feng



we have a  metric `numRestarts` which indicate how many times a job failover ， 
but we don't have a metric indicate the job recover from ha ( high 
availability).

there are two problems:

1. when a  jobmanager process crashed , we have no way of knowing that 
jobmanager is crash and job was recovered from metric system 

2. when a new jobmanager become leader, the  `numRestarts`  will started from 
zero, 
Sometimes misleading our users。most user think that whether failover because of 
a JM failure or because of a job failure, these failover is same , the effect, 
at least, is the same.
 
I suggest we can 
1. add new metric that indicate how many time the job was recovered from ha
2. metric `numRestarts` also count the times recover from ha  
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-31482) support count jobmanager-failed failover times

Reply via email to