[
https://issues.apache.org/jira/browse/FLINK-10907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689414#comment-16689414
]
ASF GitHub Bot commented on FLINK-10907:
----------------------------------------
zentol closed pull request #7119: [FLINK-10907] Fix Flink JobManager metrics
from getting stuck after a job recovery.
URL: https://github.com/apache/flink/pull/7119
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git
a/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/JobManagerMetricGroup.java
b/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/JobManagerMetricGroup.java
index e09051d7160..f67b49d6745 100644
---
a/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/JobManagerMetricGroup.java
+++
b/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/JobManagerMetricGroup.java
@@ -61,16 +61,17 @@ public String hostname() {
public JobManagerJobMetricGroup addJob(JobGraph job) {
JobID jobId = job.getJobID();
String jobName = job.getName();
- // get or create a jobs metric group
- JobManagerJobMetricGroup currentJobGroup;
synchronized (this) {
if (!isClosed()) {
- currentJobGroup = jobs.get(jobId);
+ JobManagerJobMetricGroup currentJobGroup =
jobs.get(jobId);
- if (currentJobGroup == null ||
currentJobGroup.isClosed()) {
- currentJobGroup = new
JobManagerJobMetricGroup(registry, this, jobId, jobName);
- jobs.put(jobId, currentJobGroup);
+ if (currentJobGroup != null) {
+ currentJobGroup.close();
}
+
+ currentJobGroup = new
JobManagerJobMetricGroup(registry, this, jobId, jobName);
+ jobs.put(jobId, currentJobGroup);
+
return currentJobGroup;
} else {
return null;
diff --git
a/flink-runtime/src/test/java/org/apache/flink/runtime/metrics/groups/JobManagerGroupTest.java
b/flink-runtime/src/test/java/org/apache/flink/runtime/metrics/groups/JobManagerGroupTest.java
index cb5ec67c97c..146fb3b1f45 100644
---
a/flink-runtime/src/test/java/org/apache/flink/runtime/metrics/groups/JobManagerGroupTest.java
+++
b/flink-runtime/src/test/java/org/apache/flink/runtime/metrics/groups/JobManagerGroupTest.java
@@ -32,6 +32,7 @@
import static org.junit.Assert.assertArrayEquals;
import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotEquals;
import static org.junit.Assert.assertTrue;
/**
@@ -58,13 +59,15 @@ public void addAndRemoveJobs() throws Exception {
JobManagerJobMetricGroup jmJobGroup12 = group.addJob(new
JobGraph(jid1, jobName1));
JobManagerJobMetricGroup jmJobGroup21 = group.addJob(new
JobGraph(jid2, jobName2));
- assertEquals(jmJobGroup11, jmJobGroup12);
+ assertNotEquals(jmJobGroup11, jmJobGroup12);
+ assertTrue(jmJobGroup11.isClosed());
+ assertTrue(!jmJobGroup12.isClosed());
assertEquals(2, group.numRegisteredJobMetricGroups());
group.removeJob(jid1);
- assertTrue(jmJobGroup11.isClosed());
+ assertTrue(jmJobGroup12.isClosed());
assertEquals(1, group.numRegisteredJobMetricGroups());
group.removeJob(jid2);
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Job recovery on the same JobManager causes JobManager metrics to report stale
> values
> ------------------------------------------------------------------------------------
>
> Key: FLINK-10907
> URL: https://issues.apache.org/jira/browse/FLINK-10907
> Project: Flink
> Issue Type: Bug
> Components: Core, Metrics
> Affects Versions: 1.4.2
> Environment: Verified the bug and the fix running on Flink 1.4
> Based on the JobManagerMetricGroup.java code in master, this issue should
> still occur on Flink versions after 1.4.
> Reporter: Mark Cho
> Priority: Minor
> Labels: pull-request-available
>
> https://github.com/apache/flink/pull/7119
> * JobManager loses and regains leadership if it loses connection and
> reconnects to ZooKeeper.
> * When it regains the leadership, it tries to recover the job graph.
> * During the recovery, it will try to reuse the existing
> {{JobManagerMetricGroup}} to register new counters and gauges under the same
> metric name, which causes the new counters and gauges to be registered
> incorrectly.
> * The old counters and gauges will continue to
> report the stale values and the new counters and gauges will not report
> the latest metric.
> Relevant lines from logs
> {code:java}
> com.---.JobManager - Submitting recovered job
> e9e49fd9b8c61cf54b435f39aa49923f.
> com.---.JobManager - Submitting job e9e49fd9b8c61cf54b435f39aa49923f
> (flink-job) (Recovery).
> com.---.JobManager - Running initialization on master for job flink-job
> (e9e49fd9b8c61cf54b435f39aa49923f).
> com.---.JobManager - Successfully ran initialization on master in 0 ms.
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'totalNumberOfCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfInProgressCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfCompletedCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'numberOfFailedCheckpoints'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointRestoreTimestamp'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointSize'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointDuration'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointAlignmentBuffered'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'lastCheckpointExternalPath'. Metric will not be
> reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'restartingTime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'downtime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'uptime'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'fullRestarts'. Metric will not be reported.[]
> org.apache.flink.metrics.MetricGroup - Name collision: Group already contains
> a Metric with the name 'task_failures'. Metric will not be reported.[]
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)