[
https://issues.apache.org/jira/browse/ZOOKEEPER-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233843#comment-13233843
]
Ted Dunning commented on ZOOKEEPER-1424:
----------------------------------------
OK. THis is strange. I made some changes at git://github.com/tdunning/play.git
These include:
- made the test run as a test instead of as main. (trivial change to all mvn
test to work)
- changed the classpath logic to look in ../zookeeper-3.4.5 for a dev version
of Zookeeper. (trivial change for convenience)
- unrolled all of the multi's into a loop that calls each op separately. (this
is the money)
Around line 139, I have two versions of an unrolled delete. One is this:
{code}
zooKeeper.delete(op.getPath(), -1);
{code}
and the other is this:
{code}
zooKeeper.multi(ImmutableList.of(op));
{code}
These should be equivalent.
They are not.
So the problem has nothing really to do with the multi-ness and seems to have
something to do with the
way that multi does one of the single deletions.
I don't have more time today, but hopefully this puts the ball a little further
down the field.
> ZooKeeper will not allow a client to delete a tree when it should allow it
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1424
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1424
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.2
> Environment: Linux ubuntu 11.10, Zookeeper 3.4.2, One server, Two
> Java clients
> Reporter: Mihai Claudiu Toader
> Attachments: zookeeper.log
>
>
> Hi all,
> While using zookeeper at midokura we hit an interesting bug in zookeeper. We
> did hit it sporadically
> while developing some functional tests so i had to build a test case for it.
> I finally created the test case and i think i narrowed down the conditions
> under which it happens.
> So i wanted to let you know my findings since they are somewhat troublesome.
> We need:
> - one running zookeeper server (didn't test that with a cluster)
> let's name this: server
> - one running zookeeper client that will create an ephemeral node under the
> tree created by the next client
> let's name this: the ephemeral client
> - one running zookeeper client that will create a persistent tree and try
> to delete that tree
> let's name this: the persistent client
> What needs to happen is this:
> step 1. - the server starts
> step 2. - the persistent client connects and creates a tree
> step 3. - the ephemeral client connects and adds a ephemeral node under the
> tree created by the persistent client
> step 4. - the persistent client will try to delete the tree recursively
> (without including the ephemeral node in the multi op
> step 5. - the ephemeral client crashes hard (the equivalent of kill -9)
> step 6. - the persistent client will try to delete the tree recursively
> again (and fail with NoEmptyNode even if when we list the node we don't see
> any childrens)
> - the zookeeper server needs to be restarted in order for this to work.
> The step 4 is critical in the sense that if we don't have that (there is no
> previous error trying to remove a tree) then the nexts steps behave as we
> would expect them to behave (aka pass).
> Also no amount of fiddling with zookeeper connection timeouts (between
> zookeeper and ephemeral node) will help.
>
> If the ephemeral client is shutdown properly it seems like everything will
> behave properly (even with step 4).
> The test code is available here:
> https://github.com/mtoadermido/play
> It needs an zookeepr 3.4.2 installed on the system (it uses the installed
> jars from the deb to spawn the zookeeper server).
> The entry point is
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java
> There is a lot of boiler plate since i didn't want it to be depending on
> stuff from midonet but the interesting part is the BlockingBug.main() method.
> It will launch a zookeeper process, an external ephemeral client process, and
> after that act as the second client.
> Available tweaks:
> - the zookeeper client timeout for the ephemeral client here:
>
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L56
> - the step 4 here (set to true / false):
>
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L69
> - the shutdown of the ephemeral client (soft aka clean shutdown, hard aka
> kill -9):
>
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L88
> The result is displayed depending on the fact that the final recursive
> deletion succeeded or not:
>
> We hit it !. The clear tree failed.
>
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L103
> "No error :("
>
> https://github.com/mtoadermido/play/blob/master/src/main/java/com/midokura/tests/zookeeper/BlockingBug.java#L99
> The conclusion is that the bug seems to be inside the zookeeper codebase and
> it's prone to being triggered by this
> particular usage of zookeeper combined with the misfortune of having to kill
> the ephemeral process hard.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira