ZooKeeper will not allow a client to delete a tree when it should allow it
--------------------------------------------------------------------------

                 Key: ZOOKEEPER-1424
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1424
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.4.2
         Environment: Linux ubuntu 11.10, Zookeeper 3.4.2, One server, Two Java 
clients
            Reporter: Mihai Claudiu Toader


Hi all, 

While using zookeeper at midokura we hit an interesting bug in zookeeper. We 
did hit it sporadically 
while developing some functional tests so i had to build a test case for it. 

I finally created the test case and i think i narrowed down the conditions 
under which it happens. 
So i wanted to let you know my findings since they are somewhat troublesome. 

We need:
  - one running zookeeper server (didn't test that with a cluster)
      let's name this: server

  - one running zookeeper client that will create an ephemeral node under the 
tree created by the next client
      let's name this: the ephemeral client

  - one running zookeeper client that will create a persistent tree and try to 
delete that tree
      let's name this: the persistent client

What needs to happen is this:

 step 1. - the server starts
 step 2. - the persistent client connects and creates a tree
 step 3. - the ephemeral client connects and adds a ephemeral node under the 
tree created by the persistent client
 step 4. - the persistent client will try to delete the tree recursively 
(without including the ephemeral node in the multi op
 step 5. - the ephemeral client crashes hard (the equivalent of kill -9)
 step 6. - the persistent client will try to delete the tree recursively again 
(and fail with NoEmptyNode even if when we list the node we don't see any 
childrens)
    - the zookeeper server needs to be restarted in order for this to work. 

The step 4 is critical in the sense that if we don't have that (there is no 
previous error trying to remove a tree) then the nexts steps behave as we would 
expect them to behave (aka pass). 

Also no amount of fiddling with zookeeper connection timeouts (between 
zookeeper and ephemeral node) will help. 
 
If the ephemeral client is shutdown properly it seems like everything will 
behave properly (even with step 4). 

The test code is available here:
   https://github.com/mtoadermido/play

It needs an zookeepr 3.4.2 installed on the system (it uses the installed jars 
from the deb to spawn the zookeeper server).

The entry point is 
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java

There is a lot of boiler plate since i didn't want it to be depending on stuff 
from midonet but the interesting part is the BlockingBug.main() method. 

It will launch a zookeeper process, an external ephemeral client process, and 
after that act as the second client. 

Available tweaks:
- the zookeeper client timeout for the ephemeral client here: 
  
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L56

- the step 4 here (set to true / false):
 
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L69

- the shutdown of the ephemeral client (soft aka clean shutdown, hard aka kill 
-9):
 
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L88

The result is displayed depending on the fact that the final recursive deletion 
succeeded or not:
   
We hit it !. The clear tree failed.
   
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L103

"No error :("  
   
https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L99


The conclusion is that the bug seems to be inside the zookeeper codebase and 
it's prone to being triggered by this 
particular usage of zookeeper combined with the misfortune of having to kill 
the ephemeral process hard. 




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to