Re: FYI - interesting/important Bug

Jordan Zimmerman Fri, 14 Dec 2018 10:22:20 -0800

Wow - 3 successful runs in a row on Jenkins. That's something we haven't seen 
with Curator for years. This is looking good.


> On Dec 14, 2018, at 7:52 AM, Jordan Zimmerman <[email protected]> 
> wrote:
> 
> Here's the PR for this:
> 
>       https://github.com/apache/curator/pull/297
> 
> I've run the tests a few times and it seems good. Another pair of eyes or two 
> would be appreciated.
> 
> -Jordan
> 
>> On Dec 13, 2018, at 2:09 PM, Jordan Zimmerman <[email protected]> 
>> wrote:
>> 
>> FYI
>> 
>> I was trying to figure out why TestInterProcessSemaphoreMutex is so flakey. 
>> In debugging the problem I've come across a fairly serious edge case in how 
>> several of our recipes work. You can see the issue in 
>> InterProcessSemaphoreV2 (which is what InterProcessSemaphoreMutex uses 
>> internally). Have a look here:
>> 
>>      
>> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373
>> 
>> The code synchronizes and then does "client.getChildren()...". This is where 
>> the problem is. If there are connection problems inside of getChildren() the 
>> retry policy will do configured sleeping, retries, etc. Importantly, this is 
>> all done while the thread doing the retires holds InterProcessSemaphoreV2's 
>> monitor. If the ZK connection is repaired past the session timeout, ZK will 
>> eventually call InterProcessSemaphoreV2's watcher with an Expired message. 
>> InterProcessSemaphoreV2's watcher calls this method:
>> 
>>      private synchronized void notifyFromWatcher()
>>      {
>>          notifyAll();
>>      }
>> 
>> You can see that this is a race. The thread doing "getChildren" is holding 
>> the monitor and is in a retry loop waiting for the connection to be 
>> repaired. However, ZK's event loop is trying to obtain that same monitor as 
>> a result of trying to call the synchronized notifyFromWatcher(). This means 
>> that the retry policy will always fail because ZK's event loop is tied up 
>> until that thread exists. Worse still, if someone were to use a retry policy 
>> of "RetryForever" they'd have a deadlock. 
>> 
>> This pattern is in about 10 files or so. I'm trying to think of a 
>> workaround. One possibility is to use a separate thread for this type of 
>> notification. i.e. notifyFromWatcher() would just signal another thread that 
>> the notifyAll() needs to be called. This would unblock ZK's event thread so 
>> that things can progress. I'll play around with this.
>> 
>> If anyone has a better idea or thoughts I'd appreciate it.
>> 
>> -Jordan
>> 
>

Re: FYI - interesting/important Bug

Reply via email to