Stephen Ingram created CURATOR-205:
--------------------------------------
Summary: Repeated InterruptedExceptions during mutex aquire leads
to LeaderSelector deadlock
Key: CURATOR-205
URL: https://issues.apache.org/jira/browse/CURATOR-205
Project: Apache Curator
Issue Type: Bug
Components: Recipes
Affects Versions: 2.7.2
Reporter: Stephen Ingram
When an InterruptedException is thrown during the internalLockLoop that is
called during mutex.acquire, internalLockLoop will set a flag "doDelete" which
signals during a finally clause to delete the lock path that we are trying to
create.
However, in the pathInForeground function of DeleteBuilderImpl, a _second_
InterruptedException may occur before zookeeper can delete the specified path.
The RetryLoop machinery contained in the function will only retry if it is a
Retryable Exception, an equivalence class which does not include
InterruptedExceptions.
The second InterruptedException exception then causes an exit of the
pathInForeground function without deleting the path, leading to a deadlock
where no one can acquire the mutex.
In my test, I am certain that both of these InterruptedExceptions are due to
repeated fluctuation in the ConnectionStateManager's connection state. When
the state ceases to fluctuate, no leader can be selected due to the persistence
of the node we failed to delete.
I was able to address this bug with a solution similar to CURATOR-45: if the
pathInForeground function is interrupted with an InterruptedException, I
schedule a BackgroundCallback to attempt pathInForeground again. This task is
able to delete the path when the connection is stable and the mutex is acquired
by the new leader.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)