[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Jean-Daniel Cryans (JIRA) Fri, 18 Mar 2011 13:20:52 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008613#comment-13008613
 ]


Jean-Daniel Cryans commented on HBASE-3668:
-------------------------------------------

bq. The master eventually tries to assign the region somewhere else where it 
may work, but then the balancer will screw everything up again.

What I meant by that is that the cluster was able to reach a "stable" state 
when the defective region servers we finally stripped out of every region they 
were assigned, but then the balancer tries to even the distribution so the 
cycle repeats. In essence, the balancer isn't doing anything wrong... it 
couldn't know that those RS were in a bad state.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half 
> the region servers unable to open regions (we didn't kill the master but all 
> RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 
> nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00007fbd4e1ac6d0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
>       - locked <0x00007fbd4e1ac6d0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
>       at 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
>       - locked <0x00007fbd4d7ab060> (a 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
>       - locked <0x00007fbd4d7ab060> (a 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a 
> nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in 
> the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
> regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, 
> type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x12d627b723e081f Unable to get data of znode 
> /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x12d627b723e081f Set watcher on existing znode 
> /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed 
> assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
> Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is 
> that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and 
> then since the timeout is 0 then it waits. Even if it was looping, 
> metaAvailable will never be true since we're already inside the event thread 
> thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a 
> region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting 
> thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception 
> running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at 
> org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may 
> work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Reply via email to