[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609140#comment-14609140
 ] 

Andra Lungu commented on FLINK-2299:
------------------------------------

Hi [~StephanEwen],

First of all, thanks a lot for looking into this! I know you're super busy.
Let me try to answer your questions. 

For the `heavy collisions in keys` issue, I actually followed the ML 
discussions - otherwise the code wouldn't have worked at all. What I did was to 
add a join hint : .join(this.vertices, 
JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(1).equalTo(0). The 
problem is that I was just doing that for one case; I will apply the same fix 
for all the others and include this job in my nightly build. 

Yes, so the problem there (GroupReduce at main(TriangleCount.java:51)) is that 
it takes pairs of type (5,1) (5,2) (5,3) etc and does a grouBy.reduce to create 
a TreeMap which for a highly skewed node in the twitter follower graph will 
result in a huge data structure. Is TreeMap not the right candidate for the job 
because of its memory implications? Anyway, your assumption is true. This is, 
however what I am trying to prove; that skewed nodes drastically affect 
computation...  

> The slot on which the task maanger was scheduled was killed
> -----------------------------------------------------------
>
>                 Key: FLINK-2299
>                 URL: https://issues.apache.org/jira/browse/FLINK-2299
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 0.9, 0.10
>            Reporter: Andra Lungu
>            Priority: Critical
>             Fix For: 0.9.1
>
>
> The following code: 
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
> Ran on the twitter follower graph: 
> http://twitter.mpi-sws.org/data-icwsm2010.html 
> With a similar configuration to the one in FLINK-2293
> fails with the following exception:
> java.lang.Exception: The slot in which the task was executed has been 
> released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
> wally025 - 8 slots - URL: 
> akka.tcp://[email protected]:56135/user/taskmanager
>         at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>         at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>         at 
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>         at 
> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
>         at 
> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
>         at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>         at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>         at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>         at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>         at 
> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 06/29/2015 10:33:46     Job execution switched to status FAILING.
> The logs are here:
> https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to