[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004012#comment-17004012
 ] 

hiliuxg commented on FLINK-15388:
---------------------------------

That error log was a colleague who asked for the job status every 10 seconds 
and showed it on his webui. I've asked him to cancel it, and then watch it 
again.

In addition, I found that the jobmanager will flash a warn log every 10 
seconds. Is this TM's heartbeat log? Why does the link fail with only a few 
machines? 
{code:java}
// Some comments here
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system 
[akka.tcp://flink@10.1.209.236:34594] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.236:34594]] 
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system 
[akka.tcp://flink@10.1.209.236:34594] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.236:34594]] 
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system 
[akka.tcp://flink@10.1.209.156:17679] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.156:17679]] 
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system 
[akka.tcp://flink@10.1.209.156:17679] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.156:17679]] 
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection 
to [null] failed with java.net.ConnectException: Connection refused: 
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system 
[akka.tcp://flink@10.1.209.158:27065] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.158:27065]] 
Caused by: [Connection refused: /10.1.209.158:27065]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN  
akka.remote.ReliableDeliverySupervisor 
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system 
[akka.tcp://flink@10.1.209.158:27065] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://flink@10.1.209.158:27065]] 
Caused by: [Connection refused: /10.1.209.158:27065]
{code}

The TM list :
 !tm.png! 



> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> ------------------------------------------------------------------
>
>                 Key: FLINK-15388
>                 URL: https://issues.apache.org/jira/browse/FLINK-15388
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.8.0
>         Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>            Reporter: hiliuxg
>            Priority: Major
>         Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, tm.png, 
> 屏幕快照 2019-12-27 下午3.05.36.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to