[ 
https://issues.apache.org/jira/browse/HBASE-14370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735217#comment-14735217
 ] 

Enis Soztutar commented on HBASE-14370:
---------------------------------------

It is standard practice to fork off a thread to do the work in the zk client 
since the zk notification is handled by a single thread. If a zk watcher runs 
for a very long time, other zk notifications are just blocked waiting for the 
event notification thread. We have to be careful though because the ordering of 
the execution for these should follow the zk notification order. 

In the case Ted mentioned, even with HBASE-12635 fixed, the 
{{ZKPermissionWatcher#refreshNodes()}} does a getData() for every 2000+ child 
znode which is quite costly. The other notifications (like assignment 
notifications) are waiting, and causing assignment to take a very long time. 

> Use separate thread for calling ZKPermissionWatcher#refreshNodes()
> ------------------------------------------------------------------
>
>                 Key: HBASE-14370
>                 URL: https://issues.apache.org/jira/browse/HBASE-14370
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 14370-v1.txt
>
>
> I came off a support case (0.98.0) where main zk thread was seen doing the 
> following:
> {code}
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.refreshAuthManager(ZKPermissionWatcher.java:152)
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.refreshNodes(ZKPermissionWatcher.java:135)
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.nodeChildrenChanged(ZKPermissionWatcher.java:121)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
>   at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> {code}
> There were 62000 nodes under /acl due to lack of fix from HBASE-12635, 
> leading to slowness in table creation because zk notification for region 
> offline was blocked by the above.
> The attached patch separates refreshNodes() call into its own thread.
> Thanks to Enis and Devaraj for offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to