After a lengthy, but successful data ingestion run, I was running some
queries against my HBase table when one of my region servers ran out of
memory and became unresponsive. I shut down the HBase servers via
"stop-hbase.sh" and the one region server didn't terminate, so I killed it
via "kill" and then restarted the servers. Ever since I did that, when I try
to access my table, the request stalls, eventually fails, and a number of
exceptions like the following appear in the log of one of region servers
(oddly enough, not the same one every time)...
2009-01-29 13:07:50,439 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Could not obtain block: blk_2439003473799601954_58348
file=/hbase/-ROOT-/70236052/info/mapfiles/2587717070724571438/data
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1593)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.hbase.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1909)
at
org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1939)
at
org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1844)
at
org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1890)
at org.apache.hadoop.hbase.io.MapFile$Reader.next(MapFile.java:525)
at
org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromMapFile(HStore.java:1714)
at
org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1686)
at
org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1088)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1548)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:895)
I ran fsck on HDFS and it's healthy. I'm guessing that something needed to
be flushed from the region server that I killed and now my table is in a
corrupt state. I have a couple of questions:
- Is there a way to recover from this problem or do I need to rerun my
ingestion job?
- When a region server runs out of memory, is there a better way to kill it
other than the "kill" command? I've been reading the postings related to out
of memory errors and plan to try some of the suggestions. However, if it
does happen should I use one of the other scripts in the "bin" directory to
do a graceful shutdown?
Hadoop 0.19.0
HBase 0.19.0
Thanks
Larry Compton