[ 
https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105044#comment-13105044
 ] 

ChiaHung Lin commented on HAMA-387:
-----------------------------------

If I am correct, that looks like originally we do not deal with 
KeeperException.NodeExistsException, which means znode proposed has already 
been created. We have several GroomServers starting to create znode (e.g. 
JobId/superstep/TaskId) on zookeeper; therefore, it is possible to have 2 (or 
more) BSPPeers writing the same znode in the scene similar to check-then-act 
scenario. For example, 2 BSPPeers check (zk.exists(path)) if znode path exists 
or not simultaneously, then they decide to create the znode 
(zk.create(path...)) because the Stat returned is null, indicating no znode 
exists. Unfortunately, one BSPPeer is writing fast than the other, resulting in 
that the second BSPPeer fails in creating znode because znode exists. Thus all 
computation hangs because `list.size() < jobConf.getNumBspTask()' is always 
true in while loop. 

For the ArrayIndexOutOfBoundsException, it seems the parameter peerName, which 
should be encoded like host:port (in getAddress() peerName is split by `:' into 
an array), in BSPPeer.send() function is malformed. 


> Advanced Barrier Synchronization
> --------------------------------
>
>                 Key: HAMA-387
>                 URL: https://issues.apache.org/jira/browse/HAMA-387
>             Project: Hama
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.3.0
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>             Fix For: 0.4.0
>
>         Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch, 
> HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
>  * the job ID
>  * the task ID of the lock file owner
>  * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one 
> groomserver in the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to