[ 
https://issues.apache.org/jira/browse/CURATOR-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196385#comment-15196385
 ] 

Cameron McKenzie commented on CURATOR-308:
------------------------------------------

Actually, this is a bit more complicated.

[~randgalt] : In SimpleDistributedQueue, it uses EnsureContainers on the root 
path. This is problematic because it means that when the last piece of work in 
the queue gets removed then the root path gets removed, and won't get recreated 
until another piece of work is added. This exposes a race condition in the 
take() method, whereby if the root node doesn't exist, the take method will 
never return. This is because it blocks waiting for children to exist in a path 
that doesn't exist, so the watch never fires.

So, there are two options to fix this I guess:
-Make the EnsureContainers stuff have an option of doing the ensure via a 
persistent node rather than a container node.
-Modify the take method so that it can handle the root node disappearing.

I think that it's probably best to fix both. If the root node disappears for 
some reason currently then the take method() will block forever which isn't 
ideal. It should instead run a checkExists() watcher to wait for the node to 
come back again.

Thoughts?


> SimpleDistributedQueue::take() hangs if container nodes are removed
> -------------------------------------------------------------------
>
>                 Key: CURATOR-308
>                 URL: https://issues.apache.org/jira/browse/CURATOR-308
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 3.1.0
>         Environment: org.apache.curator:curator-recipes 3.1.0
> org.apache.curator:curator-test 3.1.0
>            Reporter: Philip Searle
>         Attachments: TestSimpleDistributedQueueHang.java
>
>
> SimpleDistributedQueue creates the queue using container nodes if the 
> ZooKeeper instance supports this feature. If ZooKeeper runs the container 
> node cleanup task while SimpleDistributedQueue::take() is blocking, the call 
> will not ever return.
> A similar issue occurs when calling poll(), resulting in it delaying until 
> the timeout has elapsed, even if a queue item was inserted after the 
> container cleanup occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to