[
https://issues.apache.org/jira/browse/FLINK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann reassigned FLINK-9228:
------------------------------------
Assignee: (was: makeyang)
> log details about task fail/task manager is shutting down
> ---------------------------------------------------------
>
> Key: FLINK-9228
> URL: https://issues.apache.org/jira/browse/FLINK-9228
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.4.2
> Reporter: makeyang
> Priority: Minor
> Fix For: 1.7.3
>
>
> condition:
> flink version:1.4.2
> jdk version:1.8.0.20
> linux version:3.10.0
> problem description:
> one of my task manager is out of the cluster and I checked its log found
> something below:
> 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task
>
> - Attempting to fail task externally Process (115/120)
> (19d0b0ce1ef3b8023b37bdfda643ef44).
> 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task
>
> - Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING
> to FAILED.
> java.lang.Exception: TaskManager is shutting down.
> at
> org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220)
>
> at akka.actor.Actor$class.aroundPostStop(Actor.scala:515)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121)
>
> at
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>
> at
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
> at akka.actor.ActorCell.terminate(ActorCell.scala:374)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> suggestion:
> # short term suggestion:
> ## log reasons why task tail?maybe received some event from job
> manager/can't connect to job manager? operator exception? the more claritify
> the better
> ## log reasons why task manager is shutting down? received some event from
> job manager/can't connect to job manager? operator exception can't be
> recovery?
> # long term suggestion:
> ## define the state machine of flink node clearly. if nothing happens, the
> node should stay what it used to be, which means if it is processing events,
> if nothing happens, it should still processing events.or in other words, if
> its state changes from processing event to cancel, then event happens.
> ## define the events which can cause node state changed clearly. like use
> cancel, operator exception, heart beat timeout etc
> ## log the state change and event which cause state chaged clearly in logs
> ## show event details(time, node, event, state changed etc) in webui
--
This message was sent by Atlassian Jira
(v8.3.4#803005)