[
https://issues.apache.org/jira/browse/FLINK-10907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689413#comment-16689413
]
ASF GitHub Bot commented on FLINK-10907:
----------------------------------------
zentol commented on issue #7119: [FLINK-10907] Fix Flink JobManager metrics
from getting stuck after a job recovery.
URL: https://github.com/apache/flink/pull/7119#issuecomment-439391428
Not a problem in 1.5 and above, see the JIRA for more details.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Job recovery on the same JobManager causes JobManager metrics to report stale
> values
> ------------------------------------------------------------------------------------
>
> Key: FLINK-10907
> URL: https://issues.apache.org/jira/browse/FLINK-10907
> Project: Flink
> Issue Type: Bug
> Components: Core, Metrics
> Affects Versions: 1.4.2
> Environment: Verified the bug and the fix running on Flink 1.4
> Based on the JobManagerMetricGroup.java code in master, this issue should
> still occur on Flink versions after 1.4.
> Reporter: Mark Cho
> Priority: Minor
> Labels: pull-request-available
>
> https://github.com/apache/flink/pull/7119
> * JobManager loses and regains leadership if it loses connection and
> reconnects to ZooKeeper.
> * When it regains the leadership, it tries to recover the job graph.
> * During the recovery, it will try to reuse the existing
> {{JobManagerMetricGroup}} to register new counters and gauges under the same
> metric name, which causes the new counters and gauges to be registered
> incorrectly.
> * The old counters and gauges will continue to
> report the stale values and the new counters and gauges will not report
> the latest metric.
> Relevant lines from logs
> {code:java}
> com.---.JobManager - Submitting recovered job
> e9e49fd9b8c61cf54b435f39aa49923f.
> com.---.JobManager - Submitting job e9e49fd9b8c61cf54b435f39aa49923f
> (flink-job) (Recovery).
> com.---.JobManager - Running initialization on master for job flink-job
> (e9e49fd9b8c61cf54b435f39aa49923f).
> com.---.JobManager - Successfully ran initialization on master in 0 ms.
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'totalNumberOfCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfInProgressCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfCompletedCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfFailedCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointRestoreTimestamp'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointSize'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointDuration'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointAlignmentBuffered'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointExternalPath'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'restartingTime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'downtime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'uptime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'fullRestarts'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'task_failures'. Metric will not be reported.[]
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)