[ https://issues.apache.org/jira/browse/FLINK-10907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709116#comment-16709116 ]
Mark Cho commented on FLINK-10907: ---------------------------------- [~Zentol] I think we can close this one out as it looks like JobManager code is being actively removed in Flink 1.7+. > Job recovery on the same JobManager causes JobManager metrics to report stale > values > ------------------------------------------------------------------------------------ > > Key: FLINK-10907 > URL: https://issues.apache.org/jira/browse/FLINK-10907 > Project: Flink > Issue Type: Bug > Components: Core, Metrics > Affects Versions: 1.4.2 > Environment: Verified the bug and the fix running on Flink 1.4 > Based on the JobManagerMetricGroup.java code in master, this issue should > still occur on Flink versions after 1.4. > Reporter: Mark Cho > Priority: Minor > Labels: pull-request-available > > https://github.com/apache/flink/pull/7119 > * JobManager loses and regains leadership if it loses connection and > reconnects to ZooKeeper. > * When it regains the leadership, it tries to recover the job graph. > * During the recovery, it will try to reuse the existing > {{JobManagerMetricGroup}} to register new counters and gauges under the same > metric name, which causes the new counters and gauges to be registered > incorrectly. > * The old counters and gauges will continue to > report the stale values and the new counters and gauges will not report > the latest metric. > Relevant lines from logs > {code:java} > com.---.JobManager - Submitting recovered job > e9e49fd9b8c61cf54b435f39aa49923f. > com.---.JobManager - Submitting job e9e49fd9b8c61cf54b435f39aa49923f > (flink-job) (Recovery). > com.---.JobManager - Running initialization on master for job flink-job > (e9e49fd9b8c61cf54b435f39aa49923f). > com.---.JobManager - Successfully ran initialization on master in 0 ms. > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'totalNumberOfCheckpoints'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'numberOfInProgressCheckpoints'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'numberOfCompletedCheckpoints'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'numberOfFailedCheckpoints'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'lastCheckpointRestoreTimestamp'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'lastCheckpointSize'. Metric will not be reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'lastCheckpointDuration'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'lastCheckpointAlignmentBuffered'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'lastCheckpointExternalPath'. Metric will not be > reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'restartingTime'. Metric will not be reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'downtime'. Metric will not be reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'uptime'. Metric will not be reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'fullRestarts'. Metric will not be reported.[] > org.apache.flink.metrics.MetricGroup - Name collision: Group already contains > a Metric with the name 'task_failures'. Metric will not be reported.[] > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)