[ 
https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915862#comment-13915862
 ] 

Bill Havanki commented on ACCUMULO-2422:
----------------------------------------

What prevents it is that the master "gets" the lock if it has the lock node 
with the lowest sequential number, as assigned by ZooKeeper. So, extending my 
example above, the first master originally had the lock with node 206. The 
second master got 207, but noticed that 206 existed already so it set up a 
watch on it. So far, so good.

Normally, when the first master exits, the second one gets the deletion event 
and gets the lock. But in this scenario, the second master gets a node-change 
event instead. It loses the watch and will never be notified again. Now, the 
first master exits, so all that is left is node 207. The second master doesn't 
get the lock, it just waits and waits forever.

The first master restarts and gets node 208. It sees that the second master has 
207, so it sets up a watch on it, assuming that the second master has the lock. 
So, it doesn't get the lock either. It waits and waits forever.

> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>
>                 Key: ACCUMULO-2422
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2422
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.0
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
>
> While running randomwalk tests with agitation for the 1.5.1 release, I've 
> seen situations where a backup master that is eligible to grab the master 
> lock continues to wait. When this condition arises and the other master 
> restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is 
> needed to see what circumstances could be causing the problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to