[ 
https://issues.apache.org/jira/browse/FLINK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003934#comment-17003934
 ] 

Xintong Song edited comment on FLINK-15388 at 12/27/19 7:12 AM:
----------------------------------------------------------------

One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration.

This might be the reason that affects the heartbeats, because rest server need 
to access the rpc main thread.
 !屏幕快照 2019-12-27 下午3.05.36.png! 
I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.


was (Author: xintongsong):
One thing draw my attention, it seems there are quite some error messages like 
"Exception occurred in REST handler: Job 9bf1a8b3b40ddccb5aa258f150a750b1 not 
found". This indicates something that monitoring other jobs are accessing the 
wrong rest server address.

I tried to print out the time and amount of such error message, and find that 
the timepoints with lots of such error messages quite match the timepoints when 
there are high prometheus scrape duration. This might be the reason that 
affects the heartbeats, because rest server need to access the rpc main thread.

I would suggest to first find out where the rest queries come from and try to 
eliminate them, see if the problem still exist after that.

> The assigned slot bae00218c818157649eb9e3c533b86af_32 was removed.
> ------------------------------------------------------------------
>
>                 Key: FLINK-15388
>                 URL: https://issues.apache.org/jira/browse/FLINK-15388
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.8.0
>         Environment: model : standalone,not yarn
> version :  flink 1.8.0
> configration : 
> jobmanager.heap.size: 4096m
> taskmanager.heap.size: 144gb
> taskmanager.numberOfTaskSlots: 48
> taskmanager.memory.fraction: 0.7
> taskmanager.memory.off-heap: false
> parallelism.default: 1
>  
>            Reporter: hiliuxg
>            Priority: Major
>         Attachments: 236log.7z, 236log.7z, metrics.png, metrics.png, 屏幕快照 
> 2019-12-27 下午3.05.36.png
>
>
> the taskmanager's slot was removed , there was not full gc or oom , what's 
> the problem ? the error bellow
> {code:java}
> org.apache.flink.util.FlinkException: The assigned slot 
> bae00218c818157649eb9e3c533b86af_32 was removed.
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
>  at 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:847)
>  at 
> org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1161)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
>  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>  at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>  at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>  at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>  at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to