[
https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008518#comment-13008518
]
Ted Yu commented on HBASE-3668:
-------------------------------
@J-D: Can you post more log after this line ?
{code}
2011-03-17 17:57:06,988 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location, connection=org.
{code}
I also noticed that refresh passed to getMetaServerConnection() is always true.
We may remove parameter refresh.
> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
> Key: HBASE-3668
> URL: https://issues.apache.org/jira/browse/HBASE-3668
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.1
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half
> the region servers unable to open regions (we didn't kill the master but all
> RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800
> nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00007fbd4e1ac6d0> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> - locked <0x00007fbd4e1ac6d0> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> at
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> - locked <0x00007fbd4d7ab060> (a
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> - locked <0x00007fbd4d7ab060> (a
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a
> nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in
> the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
> regionserver:60020-0x12d627b723e081f Received ZooKeeper Event,
> type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x12d627b723e081f Unable to get data of znode
> /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x12d627b723e081f Set watcher on existing znode
> /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO
> org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed
> assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is
> that CT.waitForMeta will wait forever:
> {code}
> if (getMetaServerConnection(true) != null) {
> return metaLocation;
> }
> while(!stopped && !metaAvailable.get() &&
> (timeout == 0 || System.currentTimeMillis() < stop)) {
> metaAvailable.wait(timeout);
> }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and
> then since the timeout is 0 then it waits. Even if it was looping,
> metaAvailable will never be true since we're already inside the event thread
> thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a
> region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting
> thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception
> running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
> at
> org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may
> work, but then the balancer will screw everything up again.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira