[
https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915846#comment-13915846
]
Bill Havanki commented on ACCUMULO-2422:
----------------------------------------
I might have figured it out, though I still need to prove it.
The "losing" master server sets a ZK watch on the "winning" server's lock node,
so that when it disappears it can grab the lock. However, ZK watches are only
good for _one event_
([reference|http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#ch_zkWatches]).
If something else happens to the node before it is deleted, then an event for
that is sent, but no event is sent for its deletion.
Once a master gets a lock, it replaces its lock node's data when it determines
its port (see ACCUMULO-1664 and ACCUMULO-1999). This triggers a NodeDataChanged
event. Example:
{noformat}
2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - type NodeDataChanged
2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - path
/accumulo/cdeab4df-78e3-4c7f-897b-92f4d98f9602/masters/lock/zlock-0000000206
2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - state SyncConnected
{noformat}
This event is sent to the other master's watcher, which does nothing with it,
and then the watch dies. So, it won't get a NodeDeleted event later to let it
grab the lock. The way to fix this is to set a new watch.
This scenario is difficult to create because both masters need to be started
almost simultaneously, and the losing watcher must set its watch between when
the winning watcher creates its node and replaces its node data. I'm going to
try to trigger this by making the winning master delay the replacement.
> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>
> Key: ACCUMULO-2422
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2422
> Project: Accumulo
> Issue Type: Bug
> Components: fate, master
> Affects Versions: 1.5.0
> Reporter: Bill Havanki
> Assignee: Bill Havanki
> Priority: Critical
> Labels: failover, locking
>
> While running randomwalk tests with agitation for the 1.5.1 release, I've
> seen situations where a backup master that is eligible to grab the master
> lock continues to wait. When this condition arises and the other master
> restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is
> needed to see what circumstances could be causing the problem.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)