[ https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004012#comment-17004012 ]
hiliuxg commented on FLINK-15388: --------------------------------- That error log was a colleague who asked for the job status every 10 seconds and showed it on his webui. I've asked him to cancel it, and then watch it again. In addition, I found that the jobmanager will flash a warn log every 10 seconds. Is this TM's heartbeat log? Why does the link fail with only a few machines? {code:java} // Some comments here 2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.236:34594 2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.236:34594 2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-5 - Association with remote system [akka.tcp://flink@10.1.209.236:34594] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.236:34594]] Caused by: [Connection refused: /10.1.209.236:34594] 2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-5 - Association with remote system [akka.tcp://flink@10.1.209.236:34594] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.236:34594]] Caused by: [Connection refused: /10.1.209.236:34594] 2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.156:17679 2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.156:17679 2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-34 - Association with remote system [akka.tcp://flink@10.1.209.156:17679] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.156:17679]] Caused by: [Connection refused: /10.1.209.156:17679] 2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-34 - Association with remote system [akka.tcp://flink@10.1.209.156:17679] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.156:17679]] Caused by: [Connection refused: /10.1.209.156:17679] 2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.158:27065 2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.1.209.158:27065 2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-34 - Association with remote system [akka.tcp://flink@10.1.209.158:27065] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.158:27065]] Caused by: [Connection refused: /10.1.209.158:27065] 2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-34 - Association with remote system [akka.tcp://flink@10.1.209.158:27065] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.158:27065]] Caused by: [Connection refused: /10.1.209.158:27065] {code} The TM list : !tm.png! > The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed. > ------------------------------------------------------------------ > > Key: FLINK-15388 > URL: https://issues.apache.org/jira/browse/FLINK-15388 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.8.0 > Environment: model : standalone,not yarn > version : flink 1.8.0 > configration : > jobmanager.heap.size: 4096m > taskmanager.heap.size: 144gb > taskmanager.numberOfTaskSlots: 48 > taskmanager.memory.fraction: 0.7 > taskmanager.memory.off-heap: false > parallelism.default: 1 > > Reporter: hiliuxg > Priority: Major > Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, tm.png, > 屏幕快照 2019-12-27 下午3.05.36.png > > > the taskmanager's slot was removed , there was not full gc or oom , what's > the problem ? the error bellow > {code:java} > org.apache.flink.util.FlinkException: The assigned slot > bae00218c818157649eb9e3c533b86af_32 was removed. > at > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893) > at > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863) > at > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058) > at > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385) > at > org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847) > at > org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)