[
https://issues.apache.org/jira/browse/HBASE-12402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192744#comment-14192744
]
Enis Soztutar commented on HBASE-12402:
---------------------------------------
Thanks to [~jeffreyz] who helped find the issue.
ZKPermissionWatcher.start() is called from AccessController as a part of region
initialization, and since the RS just started up, it loads TableAuthManager.
{code}
public void start() throws KeeperException {
watcher.registerListener(this);
if (ZKUtil.watchAndCheckExists(watcher, aclZNode)) {
List<ZKUtil.NodeAndData> existing =
ZKUtil.getChildDataAndWatchForNewChildren(watcher, aclZNode);
if (existing != null) {
refreshNodes(existing);
}
}
initialized.countDown();
}
{code}
Notice that we register a watcher for the parent znode, but do the refresh
after we set the watcher. In this case, the watcher triggered
(nodeChildrenChanged()) before the refreshNodes() is called from start()
thread. There seems to be no guard against the start thread, and zk event
thread after registering watchers.
> ZKPermissionWatcher race condition in refreshing the cache leaving stale ACLs
> and causing AccessDenied
> ------------------------------------------------------------------------------------------------------
>
> Key: HBASE-12402
> URL: https://issues.apache.org/jira/browse/HBASE-12402
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 2.0.0, 0.98.8, 0.99.2
>
>
> In testing, we have seen an issue where a region in a newly created table
> will throw AccessDeniedException.
> There seems to be a race condition in the ZKPermissionWatcher when it is just
> starting up, and a new table is created around the same time.
> The master just created the table, and adds permissions to acl table:
> {code}
> 2014-10-30 19:21:26,494 DEBUG
> [MASTER_TABLE_OPERATIONS-ip-172-31-32-87:60000-0] access.AccessControlLists:
> Writing permission with rowKey loadtest_d1 hrt_qa: RWXCA
> {code}
> One of the region servers is just starting:
> {code}
> Thu Oct 30 19:21:11 UTC 2014 Starting regionserver on ip-172-31-32-90
> 2014-10-30 19:21:13,915 INFO [main] util.VersionInfo: HBase
> 0.98.4.2.2.0.0-1194-hadoop2
> {code}
> The node creation event is received
> {code}
> 2014-10-30 19:21:26,764 DEBUG [regionserver60020-EventThread]
> access.ZKPermissionWatcher: Updating permissions cache from node loadtest_d1
> with data:
> PBUF\x0A0\x0A\x06hrt_qa\x12&\x08\x03""\x0A\x16\x0A\x07default\x12\x0Bloadtest_d1
> \x00 \x01 \x02 \x03 \x04
> {code}
> which put the write data to the cache, only to be invalidated later shortly:
> {code}
> ...
> 2014-10-30 19:21:26,855 DEBUG [RS_OPEN_REGION-ip-172-31-32-90:60020-1]
> access.ZKPermissionWatcher: Updating permissions cache from node
> tabletwo_copytable_cell_versions_two with data:
> PBUF\x0AI\x0A\x06hrt_qa\x12?\x08\x03";\x0A/\x0A\x07default\x12$tabletwo_copytable_cell_versions_two
> \x00 \x01 \x02 \x03 \x04
> 2014-10-30 19:21:26,856 DEBUG [RS_OPEN_REGION-ip-172-31-32-90:60020-1]
> access.ZKPermissionWatcher: Updating permissions cache from node loadtest_d1
> with data: PBUF
> 2014-10-30 19:21:26,856 DEBUG [RS_OPEN_REGION-ip-172-31-32-90:60020-1]
> access.ZKPermissionWatcher: Updating permissions cache from node
> tablefour_cell_version_snapshots_copy with data: PBUF
> ...
> {code}
> Notice that the threads are different. The first one is the zk event
> notification thread, vs the other is the thread from OpenRegionHandler.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)