[
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004012#comment-17004012
]
hiliuxg edited comment on FLINK-15388 at 12/27/19 8:35 AM:
-----------------------------------------------------------
Hi [~xintongsong] :
That error log was a colleague who asked for the job status every 10 seconds
and showed it on his webui. I've asked him to cancel it, and then watch it
again.
In addition, I found that the jobmanager will flash a warn log every 10
seconds. Is this TM's heartbeat log? Why does the link fail with only a few
machines?
{code:java}
// Some comments here
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system
[akka.tcp://[email protected]:34594] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:34594]]
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system
[akka.tcp://[email protected]:34594] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:34594]]
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:17679] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:17679]]
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:17679] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:17679]]
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:27065] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:27065]]
Caused by: [Connection refused: /10.1.209.158:27065]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:27065] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:27065]]
Caused by: [Connection refused: /10.1.209.158:27065]
{code}
The TM list :
!tm.png!
was (Author: hiliuxg):
That error log was a colleague who asked for the job status every 10 seconds
and showed it on his webui. I've asked him to cancel it, and then watch it
again.
In addition, I found that the jobmanager will flash a warn log every 10
seconds. Is this TM's heartbeat log? Why does the link fail with only a few
machines?
{code:java}
// Some comments here
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.236:34594
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system
[akka.tcp://[email protected]:34594] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:34594]]
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.639 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-5 - Association with remote system
[akka.tcp://[email protected]:34594] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:34594]]
Caused by: [Connection refused: /10.1.209.236:34594]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.156:17679
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:17679] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:17679]]
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.640 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:17679] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:17679]]
Caused by: [Connection refused: /10.1.209.156:17679]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote connection
to [null] failed with java.net.ConnectException: Connection refused:
/10.1.209.158:27065
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:27065] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:27065]]
Caused by: [Connection refused: /10.1.209.158:27065]
2019-12-27 15:53:38.641 [flink-akka.actor.default-dispatcher-19] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-34 - Association with remote system
[akka.tcp://[email protected]:27065] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:27065]]
Caused by: [Connection refused: /10.1.209.158:27065]
{code}
The TM list :
!tm.png!
> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> ------------------------------------------------------------------
>
> Key: FLINK-15388
> URL: https://issues.apache.org/jira/browse/FLINK-15388
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.8.0
> Environment: model : standalone,not yarn
> version : flink 1.8.0
> configration :
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>
> Reporter: hiliuxg
> Priority: Major
> Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, tm.png,
> 屏幕快照 2019-12-27 下午3.05.36.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot
> bae00218c818157649eb9e3c533b86af_32 was removed.
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
> at
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
> at
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
> at
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)