[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Ted Yu (JIRA) Fri, 18 Mar 2011 10:10:52 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008518#comment-13008518
 ]


Ted Yu commented on HBASE-3668:
-------------------------------

@J-D: Can you post more log after this line ?
{code}
2011-03-17 17:57:06,988 DEBUG 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Lookedup root region location, connection=org.
{code}

I also noticed that refresh passed to getMetaServerConnection() is always true. 
We may remove parameter refresh.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half 
> the region servers unable to open regions (we didn't kill the master but all 
> RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 
> nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00007fbd4e1ac6d0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
>       - locked <0x00007fbd4e1ac6d0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>       at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
>       at 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
>       - locked <0x00007fbd4d7ab060> (a 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
>       - locked <0x00007fbd4d7ab060> (a 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a 
> nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in 
> the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
> regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, 
> type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x12d627b723e081f Unable to get data of znode 
> /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> regionserver:60020-0x12d627b723e081f Set watcher on existing znode 
> /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO 
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed 
> assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
> Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is 
> that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and 
> then since the timeout is 0 then it waits. Even if it was looping, 
> metaAvailable will never be true since we're already inside the event thread 
> thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a 
> region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting 
> thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception 
> running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at 
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at 
> org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may 
> work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Reply via email to