[
https://issues.apache.org/jira/browse/HAMA-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438117#comment-13438117
]
Thomas Jungblut commented on HAMA-633:
--------------------------------------
Interesting note. Yes this looks like a good catch. What's the fix? ACK when
all messages have been received?
> Fix CI Failure
> --------------
>
> Key: HAMA-633
> URL: https://issues.apache.org/jira/browse/HAMA-633
> Project: Hama
> Issue Type: Bug
> Components: bsp core
> Affects Versions: 0.5.0
> Reporter: Thomas Jungblut
> Fix For: 0.6.0
>
>
> The current nightly fails because it seems to read messages that actually
> belong to the previous superstep.
> This is reproducable also in the local runner, so this is no problem of the
> specific RPC implementations. The problem could also be in the GraphJobRunner.
> This is going to be expressed by a nullpointer exception, when a non-master
> tasks gets a aggregation message (which actually just belongs to the master).
> {noformat}
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
> VAL=0.4572019638123739
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
> VAL=0.44247448197562855
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create
> cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
> Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists
> for /bsp/job_201208172305_0001/sync/51
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create
> cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
> Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode =
> NodeExists for /bsp/job_201208172305_0001/sync/51/ready
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.457610574551534
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ true
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++
> VAL=0.2675231554874198
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false
> janus.apache.org:61001
> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
> java.lang.NullPointerException
> at
> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
> at org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
> at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
> at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
> at
> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x13936d5d8cc0002
> {noformat}
> It is very difficult to track this down, my ideas were:
> - It changes the host because of fault tolerance (contra arguments: its
> turned off and the port is smaller than the other one)
> - Messaging is broken (would also explain why pagerank does not converge
> anymore)
> Some more info:
> I know that this happens when the master tasks sends a message with the
> updated aggregator values to every slave. (line 239). Then this only message
> should be consumed arround line 246 cc.
> But it still remains in the buffer and will be consumed after all computation
> in line 209 in the parseMessages.
> Even clearing the buffer is not fixing it.
> The worst problem is, that this is not reproducable, the failure seems to
> happen just in only two to three tenth of all builds. Seems like some really
> nasty edge case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira