[
https://issues.apache.org/jira/browse/ACCUMULO-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner resolved ACCUMULO-954.
-----------------------------------
Resolution: Fixed
> ZooLock watcher can stop watching
> ---------------------------------
>
> Key: ACCUMULO-954
> URL: https://issues.apache.org/jira/browse/ACCUMULO-954
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.4.2
> Reporter: Adam Fuchs
> Assignee: Keith Turner
> Priority: Minor
> Fix For: 1.5.0, 1.4.3
>
>
> Basically, this will result in tablet servers failing to recognize when they
> lose their locks. I think the worst that can happen with this is a tablet
> server can fail to die after it loses its lock, which could bog down clients
> and create a bunch of noise in the cluster. I believe there could also be
> useless files generated that wouldn't get garbage collected. !METADATA table
> write protections and logger write protections should prevent any permanent
> damage or data loss. We have seen this result in warnings and errors that
> look like multiple hosting of tablets.
> {code}
> 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997
> 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997
> NodeDeleted SyncConnected
> 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet
> server lock
> 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet
> server lock
> /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997/zlock-0000000000
> 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet
> message from user: !SYSTEM
> {code}
> Here's what happened:
> 1. Tablet server fails to get lock, triggering the watcher on the parent node.
> 2. Watcher doesn't get reset, and doesn't take any action.
> 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object.
> 4. TabletServer loses its lock, but receives a connection loss message before
> the NodeDeleted message.
> 5. TabletServer continues to try to do work instead of killing itself.
> We could probably patch this for 1.4 by creating the ZooLock within the
> announceExistence loop, instead of reusing the one. Eventually, we ought to
> have an else branch in both of the Watchers that either reset the watch
> (resilient against zookeeper connection hiccups) or just kill the server to
> be safe.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira