[
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611939#comment-14611939
]
Andra Lungu commented on FLINK-2299:
------------------------------------
The JM seems to have died at this time:
00:46:06,185 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://[email protected]:36710] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
However, the TMs were all trying to register it at
00:34:52,404 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Trying to register at JobManager
akka.tcp://[email protected]:6123/user/jobmanager (attempt 1, timeout: 500
milliseconds)
and could not...
And in the JM, they seem to be registered:
00:34:49,034 INFO org.apache.flink.runtime.jobmanager.web.WebInfoServer
- Started web info server for JobManager on 0.0.0.0:8081
00:34:51,807 INFO org.apache.flink.runtime.instance.InstanceManager
- Registered TaskManager at wally003
(akka.tcp://[email protected]:41371/user/taskmanager) as
432442efcde05962b9fd8703399b692e. Current number of registered hosts is 1.
00:34:51,876 INFO org.apache.flink.runtime.instance.InstanceManager
- Registered TaskManager at wally002
(akka.tcp://[email protected]:36710/user/taskmanager) as
d958195edfa9d98fb0b3f83da41af5aa. Current number of registered hosts is 2.
....
At 00:46 the TMs were already doing this:
00:46:06,387 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager
- I/O manager removed spill file directory
/data/andra.lungu/flink_tmp/flink-io-27b23260-80ec-4f8a-9163-5cad90b12be7
I will not be able to run experiments until Sunday morning, but I still have
the logs. I may not be looking in the right place?! So as soon as I get my
nodes I will increase the heartbeat interval, hopefully that will do the trick
:)
> The slot on which the task maanger was scheduled was killed
> -----------------------------------------------------------
>
> Key: FLINK-2299
> URL: https://issues.apache.org/jira/browse/FLINK-2299
> Project: Flink
> Issue Type: Bug
> Affects Versions: 0.9, 0.10
> Reporter: Andra Lungu
> Priority: Critical
> Fix For: 0.9.1
>
>
> The following code:
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
> Ran on the twitter follower graph:
> http://twitter.mpi-sws.org/data-icwsm2010.html
> With a similar configuration to the one in FLINK-2293
> fails with the following exception:
> java.lang.Exception: The slot in which the task was executed has been
> released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @
> wally025 - 8 slots - URL:
> akka.tcp://[email protected]:56135/user/taskmanager
> at
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
> at
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
> at
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
> at
> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
> at
> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at
> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
> at akka.actor.ActorCell.invoke(ActorCell.scala:486)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 06/29/2015 10:33:46 Job execution switched to status FAILING.
> The logs are here:
> https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)