[ 
https://issues.apache.org/jira/browse/HAMA-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438602#comment-13438602
 ] 

Suraj Menon commented on HAMA-633:
----------------------------------

First of all the problem that we are discussing might not be the reason why you 
are seeing this issue. We can take this discussion elsewhere. 

The problem here is not that the messages are arriving late, but messages are 
arriving early. Let's say if at leaveBarrier, the current superstep number is 
10, which means that we are accepting messages for superstep 11. Before we 
increment our current superstep to 12 and initialize the queue for superstep 
12, there is a chance that a peer has already sent a message (to be processed 
at superstep 12). This message would never be seen because it gets consumed by 
message queue for superstep 11 and gets processed by the peer in superstep 11 
instead of 12. 

Tagging queues with superstep numbers(which may probably create a maximum of 3 
queues at a time ) could be a solution for this. Your idea would work too. We 
can maintain state information for queue. Going forward, it would be tough to 
have that state of total messages with async messenger and selective superstep 
progress.
                
> Fix CI Failure
> --------------
>
>                 Key: HAMA-633
>                 URL: https://issues.apache.org/jira/browse/HAMA-633
>             Project: Hama
>          Issue Type: Bug
>          Components: bsp core
>    Affects Versions: 0.5.0
>            Reporter: Thomas Jungblut
>             Fix For: 0.6.0
>
>
> The current nightly fails because it seems to read messages that actually 
> belong to the previous superstep.
> This is reproducable also in the local runner, so this is no problem of the 
> specific RPC implementations. The problem could also be in the GraphJobRunner.
> This is going to be expressed by a nullpointer exception, when a non-master 
> tasks gets a aggregation message (which actually just belongs to the master).
> {noformat}
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.4572019638123739
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.44247448197562855
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
> cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
> Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists 
> for /bsp/job_201208172305_0001/sync/51
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
> KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
> cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
> Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode = 
> NodeExists for /bsp/job_201208172305_0001/sync/51/ready
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.457610574551534
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
> janus.apache.org:61002
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ 
> VAL=0.2675231554874198
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
> 12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false 
> janus.apache.org:61001
> 12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
> java.lang.NullPointerException
>         at 
> org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
>         at org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
>         at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
>         at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
>         at 
> org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)
> 12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session 
> termination for sessionid: 0x13936d5d8cc0002
> {noformat}
> It is very difficult to track this down, my ideas were:
> - It changes the host because of fault tolerance (contra arguments: its 
> turned off and the port is smaller than the other one)
> - Messaging is broken (would also explain why pagerank does not converge 
> anymore)
> Some more info:
> I know that this happens when the master tasks sends a message with the 
> updated aggregator values to every slave. (line 239). Then this only message 
> should be consumed arround line 246 cc.
> But it still remains in the buffer and will be consumed after all computation 
> in line 209 in the parseMessages. 
> Even clearing the buffer is not fixing it. 
> The worst problem is, that this is not reproducable, the failure seems to 
> happen just in only two to three tenth of all builds. Seems like some really 
> nasty edge case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to