[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maoling updated ZOOKEEPER-1011:
-------------------------------
    Summary: fix Java Barrier Documentation example's race condition issue and 
polish up the Barrier Documentation  (was: Java Barrier Documentation example 
has a race condition issue)

> fix Java Barrier Documentation example's race condition issue and polish up 
> the Barrier Documentation
> -----------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1011
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1011
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: documentation
>            Reporter: Semih Salihoglu
>            Assignee: maoling
>            Priority: Trivial
>
> There is a race condition in the Barrier example of the java doc: 
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html. It's 
> in the enter() method. Here's the original example:
> boolean enter() throws KeeperException, InterruptedException{
>             zk.create(root + "/" + name, new byte[0], Ids.OPEN_ACL_UNSAFE,
>                     CreateMode.EPHEMERAL_SEQUENTIAL);
>             while (true) {
>                 synchronized (mutex) {
>                     List<String> list = zk.getChildren(root, true);
>                     if (list.size() < size) {
>                         mutex.wait();
>                     } else {
>                         return true;
>                     }
>                 }
>             }
>         }
> Here's the race condition scenario:
> Let's say there are two machines/nodes: node1 and node2 that will use this 
> code to synchronize over ZK. Let's say the following steps take place:
> node1 calls the zk.create method and then reads the number of children, and 
> sees that it's 1 and starts waiting. 
> node2 calls the zk.create method (doesn't call the zk.getChildren method yet, 
> let's say it's very slow) 
> node1 is notified that the number of children on the znode changed, it checks 
> that the size is 2 so it leaves the barrier, it does its work and then leaves 
> the barrier, deleting its node.
> node2 calls zk.getChildren and because node1 has already left, it sees that 
> the number of children is equal to 1. Since node1 will never enter the 
> barrier again, it will keep waiting.
> --- End of scenario ---
> Here's Flavio's fix suggestions (copying from the email thread):
> ...
> I see two possible action points out of this discussion:
>       
> 1- State clearly in the beginning that the example discussed is not correct 
> under the assumption that a process may finish the computation before another 
> has started, and the example is there for illustration purposes;
> 2- Have another example following the current one that discusses the problem 
> and shows how to fix it. This is an interesting option that illustrates how 
> one could reason about a solution when developing with zookeeper.
> ...
> We'll go with the 2nd option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to