[ 
https://issues.apache.org/jira/browse/CURATOR-205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Ingram updated CURATOR-205:
-----------------------------------
    Description: 
When an InterruptedException is thrown during the internalLockLoop that is 
called during mutex.acquire, internalLockLoop will set a flag "doDelete" which 
signals during a finally clause to delete the lock path that we are trying to 
create.

However, in the pathInForeground function of DeleteBuilderImpl, a _second_ 
InterruptedException may occur before zookeeper can delete the specified path.  
The RetryLoop machinery contained in the function will only retry if it is a 
Retryable Exception, an equivalence class which does not include 
InterruptedExceptions.  

The second InterruptedException exception then causes an exit of the 
pathInForeground function without deleting the path, leading to a deadlock 
where no one can acquire the mutex.

In my test, I am certain that both of these InterruptedExceptions are due to 
repeated fluctuation in the ConnectionStateManager's connection state.  When 
the state ceases to fluctuate, no leader can be selected due to the persistence 
of the node we failed to delete.

I was able to address this bug with a solution similar to CURATOR-45:  if the 
pathInForeground function is interrupted with an InterruptedException, I 
schedule a BackgroundCallback to attempt pathInForeground again.  This task is 
able to delete the path when the connection is stable and the mutex is acquired 
by the new leader.

I have a repro and a fix if needed.

  was:
When an InterruptedException is thrown during the internalLockLoop that is 
called during mutex.acquire, internalLockLoop will set a flag "doDelete" which 
signals during a finally clause to delete the lock path that we are trying to 
create.

However, in the pathInForeground function of DeleteBuilderImpl, a _second_ 
InterruptedException may occur before zookeeper can delete the specified path.  
The RetryLoop machinery contained in the function will only retry if it is a 
Retryable Exception, an equivalence class which does not include 
InterruptedExceptions.  

The second InterruptedException exception then causes an exit of the 
pathInForeground function without deleting the path, leading to a deadlock 
where no one can acquire the mutex.

In my test, I am certain that both of these InterruptedExceptions are due to 
repeated fluctuation in the ConnectionStateManager's connection state.  When 
the state ceases to fluctuate, no leader can be selected due to the persistence 
of the node we failed to delete.

I was able to address this bug with a solution similar to CURATOR-45:  if the 
pathInForeground function is interrupted with an InterruptedException, I 
schedule a BackgroundCallback to attempt pathInForeground again.  This task is 
able to delete the path when the connection is stable and the mutex is acquired 
by the new leader.


> Repeated InterruptedExceptions during mutex acquire leads to LeaderSelector 
> deadlock
> ------------------------------------------------------------------------------------
>
>                 Key: CURATOR-205
>                 URL: https://issues.apache.org/jira/browse/CURATOR-205
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.7.2
>            Reporter: Stephen Ingram
>
> When an InterruptedException is thrown during the internalLockLoop that is 
> called during mutex.acquire, internalLockLoop will set a flag "doDelete" 
> which signals during a finally clause to delete the lock path that we are 
> trying to create.
> However, in the pathInForeground function of DeleteBuilderImpl, a _second_ 
> InterruptedException may occur before zookeeper can delete the specified 
> path.  The RetryLoop machinery contained in the function will only retry if 
> it is a Retryable Exception, an equivalence class which does not include 
> InterruptedExceptions.  
> The second InterruptedException exception then causes an exit of the 
> pathInForeground function without deleting the path, leading to a deadlock 
> where no one can acquire the mutex.
> In my test, I am certain that both of these InterruptedExceptions are due to 
> repeated fluctuation in the ConnectionStateManager's connection state.  When 
> the state ceases to fluctuate, no leader can be selected due to the 
> persistence of the node we failed to delete.
> I was able to address this bug with a solution similar to CURATOR-45:  if the 
> pathInForeground function is interrupted with an InterruptedException, I 
> schedule a BackgroundCallback to attempt pathInForeground again.  This task 
> is able to delete the path when the connection is stable and the mutex is 
> acquired by the new leader.
> I have a repro and a fix if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to