[ https://issues.apache.org/jira/browse/GIRAPH-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avery Ching updated GIRAPH-46: ------------------------------ Attachment: diff.txt Aapo reported success and I was able to run unittests against LocalJobRunner and my local Hadoop instance. > Race condition on superstep 1 with RPC servers not started by the time that > requests are sent > --------------------------------------------------------------------------------------------- > > Key: GIRAPH-46 > URL: https://issues.apache.org/jira/browse/GIRAPH-46 > Project: Giraph > Issue Type: Bug > Affects Versions: 0.70.0 > Reporter: Avery Ching > Assignee: Avery Ching > Priority: Minor > Fix For: 0.70.0 > > Attachments: diff.txt > > > Hi, > occasionally (maybe one time in four), my giraph run fails because of the > below RuntimeException. > According to code, it should never happen: > if (msgMap == null) { // should never happen after constructor throw new > RuntimeException( "sendMessage: msgMap did not exist for " + addr + " for > vertex " + destVertex); } > This happens during superstep 1 (second superstep). My application actually > *adds* edges on superstep 1 > (to make every out-edge also an in-edge of the destination), but since I am > running only on 3 workers, > I am surprised if every worker would not had been registered in the RPC layer > initially. > One hypothesis is that Hadoop does something funny, because one of my server > was under heavy > load. Maybe Hadoop launched another worker to replace a slow worker? Can it > happen? > java.lang.RuntimeException: sendMessage: msgMap did not exist for > [hostname].ml.cmu.edu:30003 for vertex 875713 > at > org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:825) > at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:179) > at edu.cmu.selectlab.BP.BinaryBPVertex.compute(BinaryBPVertex.java:94) > at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:624) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira