Thomas Jungblut created HAMA-633:
------------------------------------

             Summary: Fix CI Failure
                 Key: HAMA-633
                 URL: https://issues.apache.org/jira/browse/HAMA-633
             Project: Hama
          Issue Type: Bug
          Components: bsp core
    Affects Versions: 0.5.0
            Reporter: Thomas Jungblut
             Fix For: 0.6.0


The current nightly fails because it seems to read messages that actually 
belong to the previous superstep.

This is reproducable also in the local runner, so this is no problem of the 
specific RPC implementations. The problem could also be in the GraphJobRunner.

This is going to be expressed by a nullpointer exception, when a non-master 
tasks gets a aggregation message (which actually just belongs to the master).

{noformat}
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.4572019638123739
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.44247448197562855

12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
cxid:0x3de zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
Path:/bsp/job_201208172305_0001/sync/51 Error:KeeperErrorCode = NodeExists for 
/bsp/job_201208172305_0001/sync/51

12/08/17 23:05:52 INFO server.PrepRequestProcessor: Got user-level 
KeeperException when processing sessionid:0x13936d5d8cc0002 type:create 
cxid:0x3e8 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error 
Path:/bsp/job_201208172305_0001/sync/51/ready Error:KeeperErrorCode = 
NodeExists for /bsp/job_201208172305_0001/sync/51/ready

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 4
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 4
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.0
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 7
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.457610574551534
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 1 / hama.2;1 / 7
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+  true 
janus.apache.org:61002
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________++ VAL=0.2675231554874198

12/08/17 23:05:52 INFO graph.GraphJobRunner: _________ 0 / hama.2;0 / 11
12/08/17 23:05:52 INFO graph.GraphJobRunner: _________+ NULL! false 
janus.apache.org:61001

12/08/17 23:05:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
java.lang.NullPointerException
        at 
org.apache.hama.graph.GraphJobRunner.parseMessages(GraphJobRunner.java:373)
        at org.apache.hama.graph.GraphJobRunner.bsp(GraphJobRunner.java:209)
        at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:166)
        at org.apache.hama.bsp.BSPTask.run(BSPTask.java:143)
        at 
org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1271)

12/08/17 23:05:52 INFO server.PrepRequestProcessor: Processed session 
termination for sessionid: 0x13936d5d8cc0002

{noformat}

It is very difficult to track this down, my ideas were:

- It changes the host because of fault tolerance (contra arguments: its turned 
off and the port is smaller than the other one)
- Messaging is broken (would also explain why pagerank does not converge 
anymore)

Some more info:

I know that this happens when the master tasks sends a message with the updated 
aggregator values to every slave. (line 239). Then this only message should be 
consumed arround line 246 cc.

But it still remains in the buffer and will be consumed after all computation 
in line 209 in the parseMessages. 
Even clearing the buffer is not fixing it. 


The worst problem is, that this is not reproducable, the failure seems to 
happen just in only two to three tenth of all builds. Seems like some really 
nasty edge case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to