Sorry, stuck in airport about to get on planes for 24 hours+ with limited time.

Full logs available at:  
https://tm-files.s3.amazonaws.com/20100618-test0-webtable-crash.tar.bz2

I witnessed a problem while testing the latest build of our TM-2 HBase+Hadoop 
internal distribution last night. TM-2 is HBase 0.20.5-dev (the latest RC) with 
no HBase patches. Where we diverge from upstream for this build is HDFS: we use 
Hadoop 0.20.2 plus 0.20-append branch changesets plus a couple of other HDFS 
patches. 

The divergence between our branch and upstream may ONLY be material with 
regards to why a region server crashed. (Separate issue I am looking into.) 
What I am reporting now is a possible problem with the master. 

As far as I can see, in this test scenario the master never reassigns regions 
away from a crashed RS. 


***
CLIENT

The client sees the RS hosting ROOT and META die.

10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: 
locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 
because: Call to /10.222.238.255:60020 failed on local exception: 
java.io.EOFException
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: Removed 
.META.,,1 for tableName=.META. from cache because of 
TestTable,9f9a5c1e-f4c3-b955-85b2-f60636a9cb25,99999999999999
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: 
locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 
because: Connection refused
10/06/19 07:59:38 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
/hbase/root-region-server got 10.222.238.255:60020
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: Root region 
location changed. Sleeping.

***
MASTER

HMaster also sees the RS hosting ROOT and META die.

2010-06-19 07:59:35,424 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_OPEN: 
TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 from 
domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192; 1 of 1
2010-06-19 07:59:35,625 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
TestTable,8b35278f-56fa-4955-93fe-2ff7c4163dc3,1276948774514 in region 
.META.,,1 with startcode=1276920183192, server=10.220.238.31:60020
2010-06-19 07:59:35,625 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: 
TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 open on 
10.220.238.31:60020
2010-06-19 07:59:35,704 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 in region 
.META.,,1 with startcode=1276920183192, server=10.220.238.31:60020
2010-06-19 08:00:04,370 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.222.238.255:60020, 
regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:00:04,373 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan 
ROOT region
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:843)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:715)
        at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
        at $Proxy1.openScanner(Unknown Source)
        at 
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:177)
        at 
org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
        at 
org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
        at 
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
2010-06-19 08:00:04,532 INFO org.apache.hadoop.hbase.master.ServerManager: 5 
region servers, 0 dead, average load 456.4
2010-06-19 08:00:07,399 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.metaScanner scanning meta region {server: 10.222.238.255:60020, 
regionname: .META.,,1, startKey: <>}
2010-06-19 08:00:07,400 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan 
one META region: {server: 10.222.238.255:60020, regionname: .META.,,1, 
startKey: <>}
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:843)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:715)
        at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
        at $Proxy1.openScanner(Unknown Source)
        at 
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:177)
        at 
org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
        at 
org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
        at 
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
2010-06-19 08:00:07,407 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 
.META. region(s) scanned
2010-06-19 08:00:29,708 INFO org.apache.hadoop.hbase.master.ServerManager: 
domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076 znode expired
2010-06-19 08:00:29,710 INFO org.apache.hadoop.hbase.master.RegionManager: 
-ROOT- region unset (but not set to be reassigned)
2010-06-19 08:00:29,710 INFO org.apache.hadoop.hbase.master.RegionManager: META 
region removed from onlineMetaRegions
2010-06-19 08:00:29,711 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of 
server domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076: logSplit: 
false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
2010-06-19 08:00:29,756 INFO org.apache.hadoop.hbase.regionserver.HLog: 
Splitting 38 hlog(s) in 
hdfs://domU-12-31-38-06-EC-81.compute-1.internal:8020/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076
2010-06-19 08:00:34,979 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:38,418 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:42,021 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:47,539 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:49,649 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:51,761 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:00:57,992 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:00,297 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:02,206 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:04,827 INFO org.apache.hadoop.hbase.master.ServerManager: 4 
region servers, 1 dead, average load 
456.25[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:01:07,690 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 
.META. region(s) scanned
2010-06-19 08:01:08,042 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:09,670 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:11,710 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:16,659 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:19,058 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:20,908 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:25,805 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:27,674 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:29,533 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:33,806 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:36,320 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:38,512 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:44,046 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:46,109 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:48,316 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:54,333 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:56,089 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:01:58,738 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:03,269 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:04,829 INFO org.apache.hadoop.hbase.master.ServerManager: 4 
region servers, 1 dead, average load 
456.25[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:02:04,908 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:07,420 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:07,692 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 
.META. region(s) scanned
2010-06-19 08:02:12,709 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:15,076 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:18,160 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:23,213 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:25,326 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:27,388 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:31,922 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:32,584 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed 
open for append, waiting on lease recovery: 
hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file 
/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
 for DFSClient_1937332980 on client 10.222.239.111, because this file is 
already being created by NN_Recovery on 10.222.238.255
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

        at org.apache.hadoop.ipc.Client.call(Client.java:740)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.append(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy0.append(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
        at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
        at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
        at 
org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
        at 
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:33,591 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed 
open for append, waiting on lease recovery: 
hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file 
/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
 for DFSClient_1937332980 on client 10.222.239.111, because this file is 
already being created by NN_Recovery on 10.222.238.255
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

        at org.apache.hadoop.ipc.Client.call(Client.java:740)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.append(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy0.append(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
        at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
        at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
        at 
org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
        at 
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:34,597 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed 
open for append, waiting on lease recovery: 
hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file 
/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
 for DFSClient_1937332980 on client 10.222.239.111, because this file is 
already being created by NN_Recovery on 10.222.238.255
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

        at org.apache.hadoop.ipc.Client.call(Client.java:740)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.append(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy0.append(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
        at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
        at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
        at 
org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
        at 
org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:36,330 INFO org.apache.hadoop.hbase.regionserver.HLog: Past 
out lease recovery
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog 
file splitting completed in 139783 millis for 
hdfs://domU-12-31-38-06-EC-81.compute-1.internal:8020/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076
2010-06-19 08:02:49,498 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete, meta 
reassignment and scanning:

*** Log recovery complete.

2010-06-19 08:02:49,498 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown 
reassigning ROOT region
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionManager: 
-ROOT- region unset (but not set to be reassigned)
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionManager: ROOT 
inserted into regionsInTransition
2010-06-19 08:02:49,498 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown 
setting to unassigned: {server: 10.222.238.255:60020, regionname: .META.,,1, 
startKey: <>}
2010-06-19 08:02:49,724 INFO org.apache.hadoop.hbase.master.RegionManager: 
Assigning region -ROOT-,,0 to 
domU-12-31-38-01-71-B1.compute-1.internal,60020,1276920193543
2010-06-19 08:02:49,796 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_OPEN: -ROOT-,,0 from 
domU-12-31-38-01-71-B1.compute-1.internal,60020,1276920193543; 1 of 1

*** ROOT is reopened

2010-06-19 08:02:49,800 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.253.118.63:60020, 
regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:02:49,854 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-06-19 08:02:49,854 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 10.253.118.63:60020, 
regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:02:49,861 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-06-19 08:02:49,888 INFO org.apache.hadoop.hbase.master.RegionManager: 
Assigning region .META.,,1 to 
domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192
2010-06-19 08:03:04,831 INFO org.apache.hadoop.hbase.master.ServerManager: 4 
region servers, 1 dead, average load 
456.5[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:03:07,693 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 
.META. region(s) scanned
2010-06-19 08:03:11,618 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_OPEN: .META.,,1 from 
domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192; 1 of 1
2010-06-19 08:03:11,619 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: .META.,,1 open on 
10.220.238.31:60020
2010-06-19 08:03:11,647 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row .META.,,1 in 
region -ROOT-,,0 with startcode=1276920183192, server=10.220.238.31:60020

*** META is reopened


***
CLIENT

After ROOT and META are reopened

*** Finds new ROOT

10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Wake. Retry 
finding root region.
10/06/19 08:02:53 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
/hbase/root-region-server got 10.253.118.63:60020
10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Found ROOT at 
10.253.118.63:60020
10/06/19 08:02:53 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
/hbase/root-region-server got 10.253.118.63:60020
10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Found ROOT at 
10.253.118.63:60020

*** Finds new META

10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: 
locateRegionInMeta attempt 8 of 10 failed; retrying after sleep of 16000 
because: Connection refused
10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: Removed 
.META.,,1 for tableName=.META. from cache because of 
TestTable,aad6db21-f4c3-e255-96d1-3bc9a9339734,99999999999999
10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: Cached location 
for .META.,,1 is 10.220.238.31:60020

*** Tries to relocate 00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 which 
was hosted on dead RS

10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Removed 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for 
tableName=TestTable from cache because of 0094d9bd-41ad-8c51-a266-1e54a277d247
10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Cached location 
for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 
10.222.238.255:60020
10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Removed 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for 
tableName=TestTable from cache because of 008169ff-7ad8-f05d-854f-954bc10de3f3

*** Still trying hours later (this region among others)

10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Cached location 
for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 
10.222.238.255:60020
10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Removed 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for 
tableName=TestTable from cache because of 008169ff-7ad8-f05d-854f-954bc10de3f3
10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Cached location 
for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 
10.222.238.255:60020


***
MASTER

So what happend with 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367?

This is the last entry in the master log regarding this region:

2010-06-19 04:25:55,646 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 open on 
10.222.238.255:60020
2010-06-19 04:25:55,663 INFO 
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 in region 
.META.,,1 with startcode=1276920180076, server=10.222.238.255:60020

The RS failure happened around 10/06/19 07:59

The master never tries to reassign this or other regions hosted on the dead RS.

Master processes MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS that come in over time 
from this point. However, never a MSG_REPORT_OPEN, or CLOSE either.

Do I have this right?

I admit I've gone through these logs hastily as I'm pressed for time.

   - Andy



      

Reply via email to