On Thu, Jun 2, 2016 at 1:28 PM, Jordan Zimmerman <[email protected]> wrote: > I believe there are two things going on: > > 1) This test uses the infinite versions of the APIs. For some reason, either > the internal lock or the semaphore code is getting stuck in wait() when > there’s a network outage and never wakes up. I have some theories I’m > working on. > > 2) This is in the category of “How Did it Ever Work”. I’m cc’ing Ben Bangert > because it was his algorithm I used for InterProcessSemaphoreV2 and I want > to run this past him. In the current implementation > (https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java > 363-371), it seems to me that if there are more waiters on semaphores than > there are available semaphores, it will wait infinitely. My solution is to > sort the ZNode children and if the index of the acquiring client is less > than the number of configured max leases, give that client the lease and be > done. E.g.
I'm not sure how the Curator version works, I can only go over how the Python Kazoo client works, and it's been awhile so I had to refresh my memory from the code. In Kazoo, there's a lock node for a given semaphore, and a lease pool node, which has a child ephemeral node per lease holder. The only client allowed to add its ephemeral node to the lease pool node is the lock holder. Clients that already acquired a lease may delete their node at anytime to release their lease. The lock works per the standard lock recipe, so all lock waiters are in line, and will wake per the standard lock recipe for lease acquisition fairness. The client that acquires the lock gets to create a lease node, unless there's currently as many lease child nodes as the lease pool node indicates are allowed to have a lease. In which case, it sets a watch on the lease pool node to wait for a lease child to go away (this was a crucial difference from curator which had nodes watching specific lease holding nodes in a sorted line of some sort resulting in possible lease starvation afaik). There should be no indefinite waiting since as soon as a lease node is deleted, the lock holder wakes and gets to create its node (and in my tests does so). It sounds like curator is using a different algorithm since it has nodes sorting their position to determine if they have a lease or not. Cheers, Ben
