[jira] [Created] (FLINK-19901) Unable to exclude metrics variables for the last metrics reporter.
Truong Duc Kien created FLINK-19901: --- Summary: Unable to exclude metrics variables for the last metrics reporter. Key: FLINK-19901 URL: https://issues.apache.org/jira/browse/FLINK-19901 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.11.2 Reporter: Truong Duc Kien We discovered a bug that leads to the setting {{scope.variables.excludes}} being ignored for the very last metric reporter. Because {{reporterIndex}} was incremented before the length check, the last metrics reporter setting is overflowed back to 0. Interestingly, this bug does not trigger when there's only one metric reporter, because slot 0 is actually overwritten with that reporter's variables instead of being used to store all variables in that case. {code:java} public abstract class AbstractMetricGroup> implements MetricGroup { ... public Map getAllVariables(int reporterIndex, Set excludedVariables) { // offset cache location to account for general cache at position 0 reporterIndex += 1; if (reporterIndex < 0 || reporterIndex >= logicalScopeStrings.length) { reporterIndex = 0; } // if no variables are excluded (which is the default!) we re-use the general variables map to save space return internalGetAllVariables(excludedVariables.isEmpty() ? 0 : reporterIndex, excludedVariables); } ... {code} [Github link to the above code|https://github.com/apache/flink/blob/3bf5786655c3bb914ce02ebb0e4a1863b205b829/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L122] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19519) Add support for port range in taskmanager.data.port
Truong Duc Kien created FLINK-19519: --- Summary: Add support for port range in taskmanager.data.port Key: FLINK-19519 URL: https://issues.apache.org/jira/browse/FLINK-19519 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.11.2 Reporter: Truong Duc Kien Flink should add support for port range in {{taskmanager.data.port}} This is very helpful when running in restrictive network environments, and port range are already available for many other port settings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18934) Idle stream does not advance watermark in connected stream
Truong Duc Kien created FLINK-18934: --- Summary: Idle stream does not advance watermark in connected stream Key: FLINK-18934 URL: https://issues.apache.org/jira/browse/FLINK-18934 Project: Flink Issue Type: Bug Components: API / DataStream Affects Versions: 1.11.1 Reporter: Truong Duc Kien Per Flink documents, when a stream is idle, it will allow watermarks of downstream operator to advance. However, when I connect an active data stream with an idle data stream, the output watermark of the CoProcessOperator does not increase. Here's a small test that reproduces the problem. https://github.com/kien-truong/flink-idleness-testing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18740) Support rate limiter in the universal kafka connector
Truong Duc Kien created FLINK-18740: --- Summary: Support rate limiter in the universal kafka connector Key: FLINK-18740 URL: https://issues.apache.org/jira/browse/FLINK-18740 Project: Flink Issue Type: Improvement Components: Connectors / Kafka Affects Versions: 1.11.1 Reporter: Truong Duc Kien Currently rate limiter is only available for kafka connector 010, but not the universal connector.{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18554) Memory exceeds taskmanager.memory.process.size when enabling mmap_read for RocksDB
Truong Duc Kien created FLINK-18554: --- Summary: Memory exceeds taskmanager.memory.process.size when enabling mmap_read for RocksDB Key: FLINK-18554 URL: https://issues.apache.org/jira/browse/FLINK-18554 Project: Flink Issue Type: Bug Components: Runtime / Configuration Affects Versions: 1.11.0 Reporter: Truong Duc Kien We are testing Flink automatic memory management feature on Flink 1.11. However, YARN kept killing our containers due to the processes' physical memory exceeds the limit, although we have tuned the following configurations: {code:java} taskmanager.memory.process.size taskmanager.memory.managed.fraction {code} We suspect that it's because we have enabled mmap_read for RocksDB, since turning this options off seems to fix the issue. Maybe Flink automatic memory management is unable to account for the addition memory required when using mmap_read ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18527) Task Managers fail to start on HDP 2.6.5 due to commons-cli conflict
Truong Duc Kien created FLINK-18527: --- Summary: Task Managers fail to start on HDP 2.6.5 due to commons-cli conflict Key: FLINK-18527 URL: https://issues.apache.org/jira/browse/FLINK-18527 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.11.0 Reporter: Truong Duc Kien When launching a new job in HDP 2.6.5, we are encountering these exceptions {code:java} 2020-07-08 16:10:36 E [default-dispatcher-4] [ o.a.f.y.YarnResourceManager] Could not start TaskManager in container container_xxx java.lang.NoSuchMethodError: org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder; at org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28) ~[flink-dist_2.12-1.11.0.jar:1.11.0] 2020-07-08 16:12:46 E [default-dispatcher-4] [ o.a.f.y.YarnResourceManager] Could not start TaskManager in container container_xxx {} [] java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.runtime.entrypoint.parser.CommandLineOptions {code} We figure this is because HDP 2.6.5 is putting commons-cli version 1.2 on the class path, while Flink is expecting version 1.3. Maybe commons-cli should also be shaded to avoid such issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-11283) Accessing the key when processing connected keyed stream
Truong Duc Kien created FLINK-11283: --- Summary: Accessing the key when processing connected keyed stream Key: FLINK-11283 URL: https://issues.apache.org/jira/browse/FLINK-11283 Project: Flink Issue Type: Improvement Components: DataStream API Reporter: Truong Duc Kien Currently, we can access the key when using \{{ KeyedProcessedFunction }} . Simillar functionality would be very useful when processing connected keyed stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11148) Rescaling operator regression in Flink 1.7
Truong Duc Kien created FLINK-11148: --- Summary: Rescaling operator regression in Flink 1.7 Key: FLINK-11148 URL: https://issues.apache.org/jira/browse/FLINK-11148 Project: Flink Issue Type: Bug Affects Versions: 1.7.0 Reporter: Truong Duc Kien We have a job using 20 TaskManager with 3 slot each. Using Flink 1.4, when we rescale a data stream from 60 to 20, each TaskManager will only have one downstream slot, that receives the data from 3 upstream slots in the same TaskManager. Using Flink 1.7, this behaviour no longer hold true, multiple downstream slots are being assigned to the same TaskManager. This change is causing imbalance in our TaskManager load. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10857) Conflict between JMX and Prometheus Metrics reporter
Truong Duc Kien created FLINK-10857: --- Summary: Conflict between JMX and Prometheus Metrics reporter Key: FLINK-10857 URL: https://issues.apache.org/jira/browse/FLINK-10857 Project: Flink Issue Type: Bug Components: Metrics Affects Versions: 1.6.2 Reporter: Truong Duc Kien When registering both JMX and Prometheus metrics reporter, the Prometheus reporter will fail with an exception. {code:java} o.a.f.r.m.MetricRegistryImpl Error while registering metric. java.lang.IllegalArgumentException: Invalid metric name: flink_jobmanager.Status.JVM.Memory.Mapped_Count at org.apache.flink.shaded.io.prometheus.client.Collector.checkMetricName(Collector.java:182) at org.apache.flink.shaded.io.prometheus.client.SimpleCollector.(SimpleCollector.java:164) at org.apache.flink.shaded.io.prometheus.client.Gauge.(Gauge.java:68) at org.apache.flink.shaded.io.prometheus.client.Gauge$Builder.create(Gauge.java:74) at org.apache.flink.metrics.prometheus.AbstractPrometheusReporter.createCollector(AbstractPrometheusReporter.java:130) at org.apache.flink.metrics.prometheus.AbstractPrometheusReporter.notifyOfAddedMetric(AbstractPrometheusReporter.java:106) at org.apache.flink.runtime.metrics.MetricRegistryImpl.register(MetricRegistryImpl.java:329) at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:379) at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.gauge(AbstractMetricGroup.java:323) at org.apache.flink.runtime.metrics.util.MetricUtils.instantiateMemoryMetrics(MetricUtils.java:231) at org.apache.flink.runtime.metrics.util.MetricUtils.instantiateStatusMetrics(MetricUtils.java:100) at org.apache.flink.runtime.metrics.util.MetricUtils.instantiateJobManagerMetricGroup(MetricUtils.java:68) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startClusterComponents(ClusterEntrypoint.java:342) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:233) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:191) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190) at org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:176) {code} This is a small program to reproduce the problem: https://github.com/dikei/flink-metrics-conflict-test I -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9628) Options to tolerate truncate failure in BucketingSink
Truong Duc Kien created FLINK-9628: -- Summary: Options to tolerate truncate failure in BucketingSink Key: FLINK-9628 URL: https://issues.apache.org/jira/browse/FLINK-9628 Project: Flink Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Truong Duc Kien When the target filesystem is corrupted, truncate operation might fail permanently, causing the job to restart repeatedly and unable to progress. There should be an option to ignore these kind of errors and allows the program to continue -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9583) Wrong number of TaskManagers' slots after recovery.
Truong Duc Kien created FLINK-9583: -- Summary: Wrong number of TaskManagers' slots after recovery. Key: FLINK-9583 URL: https://issues.apache.org/jira/browse/FLINK-9583 Project: Flink Issue Type: Bug Components: ResourceManager Affects Versions: 1.5.0 Environment: Flink 1.5.0 on YARN with the default execution mode. Reporter: Truong Duc Kien Attachments: jm.log We started a job with 120 slots, using a FixedDelayRestart strategy with the delay of 1 minutes. During recovery, some but not all Slots were released. When the job restarts again, Flink requests a new batch of slots. The total number of slots is now 193, larger than the configured amount, but the excess slots are never released. This bug does not happen with legacy mode. I've attach the job manager log. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9552) NPE in SpanningRecordSerializer during checkpoint
Truong Duc Kien created FLINK-9552: -- Summary: NPE in SpanningRecordSerializer during checkpoint Key: FLINK-9552 URL: https://issues.apache.org/jira/browse/FLINK-9552 Project: Flink Issue Type: Bug Components: Type Serialization System Affects Versions: 1.5.0 Reporter: Truong Duc Kien We're encountering NPE intermittently inside SpanningRecordSerializer during checkpoint. {code:java} 2018-06-08 08:31:35,741 [ka.actor.default-dispatcher-83] INFO o.a.f.r.e.ExecutionGraph IterationSource-22 (44/120) (c1b94ef849db0e5fb9fb7b85c17073ce) switched from RUNNING to FAILED. java.lang.RuntimeException at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.tasks.StreamIterationHead.run(StreamIterationHead.java:91) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:98) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:129) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:105) at org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit(StreamRecordWriter.java:81) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) ... 5 more 2018-06-08 08:31:35,746 [ka.actor.default-dispatcher-83] INFO o.a.f.r.e.ExecutionGraph Job xxx (8a4eaf02c46dc21c7d6f3f70657dbb17) switched from state RUNNING to FAILING. java.lang.RuntimeException at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.tasks.StreamIterationHead.run(StreamIterationHead.java:91) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:98) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:129) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:105) at org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit(StreamRecordWriter.java:81) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) ... 5 more {code} This issue is probably concurrency related, because the revelant Flink code seems to have proper null checking https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/serialization/SpanningRecordSerializer.java#L98 {code:java} // Copy from intermediate buffers to current target memory segment if (targetBuffer != null) { targetBuffer.append(lengthBuffer); targetBuffer.append(dataBuffer); targetBuffer.commit(); } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9467) No Watermark display on Web UI
Truong Duc Kien created FLINK-9467: -- Summary: No Watermark display on Web UI Key: FLINK-9467 URL: https://issues.apache.org/jira/browse/FLINK-9467 Project: Flink Issue Type: Bug Affects Versions: 1.5.0 Reporter: Truong Duc Kien Watermark is currently not shown on the web interface, because it still queries for watermark using the old metric name `currentLowWatermark` instead of the new ones `currentInputWatermark` and `currentOutputWatermark` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9465) Separate timeout for savepoint and checkpoint
Truong Duc Kien created FLINK-9465: -- Summary: Separate timeout for savepoint and checkpoint Key: FLINK-9465 URL: https://issues.apache.org/jira/browse/FLINK-9465 Project: Flink Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Truong Duc Kien Savepoint can take much longer time to perform than checkpoint, especially with incremental checkpoint enabled. This leads to a couple of troubles: * For our job, we currently have to set the checkpoint timeout much large than necessary, otherwise we would be unable to perform savepoint. * During rush hour, our cluster would encounter high rate of checkpoint timeout due to backpressure, however we're unable to migrate to a larger configuration, because savepoint also timeout. In my opinion, the timeout for savepoint should be configurable separately, both in the config file and as parameter to the savepoint command. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9459) Maven enforcer plugin prevents compilation with HDP's Hadoop
Truong Duc Kien created FLINK-9459: -- Summary: Maven enforcer plugin prevents compilation with HDP's Hadoop Key: FLINK-9459 URL: https://issues.apache.org/jira/browse/FLINK-9459 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.5.0 Reporter: Truong Duc Kien Compiling Flink with Hortonwork HDP's version of Hadoop is currently unsuccessful due to Enforce Plugin catches a problem with their Hadoop. The command used is {noformat} mvn clean install -DskipTests -Dcheckstyle.skip=true -Dmaven.javadoc.skip=true -Pvendor-repos -Dhadoop.version=2.7.3.2.6.5.0-292 {noformat} The problems: {noformat} Dependency convergence error for com.fasterxml.jackson.core:jackson-core:2.6.0 paths to dependency are: +-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT +-com.microsoft.azure:azure-storage:5.4.0 +-com.fasterxml.jackson.core:jackson-core:2.6.0 and +-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT +-com.fasterxml.jackson.core:jackson-core:2.6.0 and +-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT +-com.fasterxml.jackson.core:jackson-databind:2.2.3 +-com.fasterxml.jackson.core:jackson-core:2.2.3 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence failed with message: Failed while enforcing releasability. See above detailed error message. [INFO] FAILURE build of project org.apache.flink:flink-bucketing-sink-test {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9458) Unable to recover from job failure on YARN with NPE
Truong Duc Kien created FLINK-9458: -- Summary: Unable to recover from job failure on YARN with NPE Key: FLINK-9458 URL: https://issues.apache.org/jira/browse/FLINK-9458 Project: Flink Issue Type: Bug Affects Versions: 1.5.0 Environment: Ambari HDP 2.6.3 Hadoop 2.7.3 Job configuration: 120 Task Managers x 1 slots Reporter: Truong Duc Kien After upgrading our job to Flink 1.5, they are unable to recover from failure with the following exception appears repeatedly {noformat} 2018-05-29 04:56:06,086 [ jobmanager-future-thread-36] INFO o.a.f.r.e.ExecutionGraph Try to restart or fail the job xxx (23d9e87bf43ce163ff7db8afb062fb1d) if no longer possible. 2018-05-29 04:56:06,086 [ jobmanager-future-thread-36] INFO o.a.f.r.e.ExecutionGraph Job xxx (23d9e87bf43ce163ff7db8afb062fb1d) switched from state RESTARTING to RESTARTING. 2018-05-29 04:56:06,086 [ jobmanager-future-thread-36] INFO o.a.f.r.e.ExecutionGraph Restarting the job xxx (23d9e87bf43ce163ff7db8afb062fb1d). 2018-05-29 04:57:06,086 [ jobmanager-future-thread-36] WARN o.a.f.r.e.ExecutionGraph Failed to restart the job. java.lang.NullPointerException at org.apache.flink.runtime.jobmanager.scheduler.CoLocationConstraint.isAssignedAndAlive(CoLocationConstraint.java:104) at org.apache.flink.runtime.jobmanager.scheduler.CoLocationGroup.resetConstraints(CoLocationGroup.java:119) at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1247) at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59) at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9070) Improve performance of RocksDBMapState.clear()
Truong Duc Kien created FLINK-9070: -- Summary: Improve performance of RocksDBMapState.clear() Key: FLINK-9070 URL: https://issues.apache.org/jira/browse/FLINK-9070 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Affects Versions: 1.6.0 Reporter: Truong Duc Kien Currently, RocksDBMapState.clear() is implemented by iterating over all the keys and drop them one by one. This iteration can be quite slow with: * Large maps * High-churn maps with a lot of tombstones There are a few methods to speed-up deletion for a range of keys, each with their own caveats: * DeleteRange: still experimental, likely buggy * DeleteFilesInRange + CompactRange: only good for large ranges Flink can also keep a list of inserted keys in-memory, then directly delete them without having to iterate over the Rocksdb database again. Reference: * [RocksDB article about range deletion|https://github.com/facebook/rocksdb/wiki/Delete-A-Range-Of-Keys] * [Bug in DeleteRange|https://pingcap.com/blog/2017-09-08-rocksdbbug] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-8949) Rest API failure with long URL
Truong Duc Kien created FLINK-8949: -- Summary: Rest API failure with long URL Key: FLINK-8949 URL: https://issues.apache.org/jira/browse/FLINK-8949 Project: Flink Issue Type: Bug Components: REST, Webfrontend Affects Versions: 1.4.2 Reporter: Truong Duc Kien When you have jobs with high parallelism, the URL for a REST request can get very long. When the URL is longer than 4096 bytes, the REST API will return error {{Failure: 404 Not Found}} This can easily be seen in the Web UI, when Flink queries for the watermark using the REST API: {{GET /jobs/:jobId/vertices/:vertexId/metrics?get=0.currentLowWatermark,1.currentLowWatermark,2.currentLo...}} The request will fail with more than 170 subtasks and the watermark will not be displayed in the Web UI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-8946) TaskManager stop sending metrics after JobManager failover
Truong Duc Kien created FLINK-8946: -- Summary: TaskManager stop sending metrics after JobManager failover Key: FLINK-8946 URL: https://issues.apache.org/jira/browse/FLINK-8946 Project: Flink Issue Type: Bug Components: Metrics, TaskManager Affects Versions: 1.4.2 Reporter: Truong Duc Kien Running in Yarn-standalone mode, when the Job Manager performs a failover, all TaskManager that are inherited from the previous JobManager will not send metrics to the new JobManager and other registered metric reporters. A cursory glance reveal that these line of code might be the cause [https://github.com/apache/flink/blob/a3478fdfa0f792104123fefbd9bdf01f5029de51/flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala#L1082-L1086] Perhap the TaskManager close its metrics group when disassociating JobManager, but not creating a new one on fail-over association ? -- This message was sent by Atlassian JIRA (v7.6.3#76005)