Hey guys, Ran into some issues while testing and wanted to understand what has happened better. Got the following exception when I went to the web UI
Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row '', but failed after 3 attempts. Exceptions: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254) at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >From a program that reads from a HBase table: java.lang.reflect.UndeclaredThrowableException at $Proxy1.getRegionInfo(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675) <snip> Followed up on the hmaster's log: 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scan of 1 row(s) of meta region {server: 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: hbasetest004.ash1.facebook.com,60020,1264706494600 2010-01-28 11:21:36,649 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1264706494600 with data 10.129.68.203:60020 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: test1 org.apache.hadoop.hbase.TableExistsException: test1 at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792) at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >From a hregionserver's logs: 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN 2010-01-28 11:22:18,794 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP The code says the following: case MSG_CALL_SERVER_STARTUP: // We the MSG_CALL_SERVER_STARTUP on startup but we can also // get it when the master is panicking because for instance // the HDFS has been yanked out from under it. Be wary of // this message. Any ideas on what is going on? The best I can come up with is perhaps a flaky DNS - would that explain this? This happened on three of our test clusters at almost the same time. Also, what is the most graceful/simplest way to recover from this? Thanks Karthik