[ 
https://issues.apache.org/jira/browse/HAMA-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438117#comment-13438117
 ] 

Thomas Jungblut commented on HAMA-633:
--------------------------------------

Interesting note. Yes this looks like a good catch. What's the fix? ACK when 
all messages have been received?
                
> Fix CI Failure
> --------------
>
>                 Key: HAMA-633
>                 URL: https://issues.apache.org/jira/browse/HAMA-633
>             Project: Hama
>          Issue Type: Bug
>          Components: bsp core
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>             Fix For: 0.6.0
>
>
> The current nightly fails because it seems to read messages that actually 
> belong to the previous superstep.
> This is reproducable also in the local runner, so this is no problem of the 
> specific RPC implementations. The problem could also be in the GraphJobRunner.
> This is going to be expressed by a nullpointer exception, when a non-master 
> tasks gets a aggregation message (which actually just belongs to the master).
> {noformat}
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.4572019638123739
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.44247448197562855
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
> cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
> Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists 
> for /bsp/job_201208172305_0001/sync/51
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
> cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
> Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode = 
> NodeExists for /bsp/job_201208172305_0001/sync/51/ready
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.457610574551534
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.2675231554874198
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false 
> janus.apache.org:61001
> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
> java.lang.NullPointerException
>         at 
> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>         at org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>         at 
> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session 
> termination for sessionid: 0x13936d5d8cc0002
> {noformat}
> It is very difficult to track this down, my ideas were:
> - It changes the host because of fault tolerance (contra arguments: its 
> turned off and the port is smaller than the other one)
> - Messaging is broken (would also explain why pagerank does not converge 
> anymore)
> Some more info:
> I know that this happens when the master tasks sends a message with the 
> updated aggregator values to every slave. (line 239). Then this only message 
> should be consumed arround line 246 cc.
> But it still remains in the buffer and will be consumed after all computation 
> in line 209 in the parseMessages. 
> Even clearing the buffer is not fixing it. 
> The worst problem is, that this is not reproducable, the failure seems to 
> happen just in only two to three tenth of all builds. Seems like some really 
> nasty edge case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to