Dru Jensen wrote:
St.Ack. Thanks for your response.
I will enable DEBUG and rerun the MR processes to try and reproduce this.
Hadoop is reporting everything is healthy using fsck.
This is a test platform so the data is not critical but my confidence
is shaken (I sound like a day trader).
Understood.
Questions:
1. Is there anything specific I should be looking for when I enable
DEBUG?
Thats a bit of a tough question. Requires study to be able to
interpret. In short, look at lines before ERRORs and WARNINGs for
anything that might explain why the ERROR or WARNING exception
(NotServingRegionExceptions are part of 'normal' operation -- its the
other exception-types you are interested in).
2. Does "bad" mean I cannot recover and i need to delete
/hbase.rootdir and start over?
Not if fsck says all is ok. I said 'bad' because I thought you would
have to do the above.
Any events on your cluster that might have effected HDFS? A tsunami
hit? Or in your case, it wouldn't take much since replication was set
to one -- did a host crash?
3. Does HBase depend on replication for normal operation? In other
words, will it work without replication enabled?
It'll work fine without replication until you lose data. Thereafter,
it'll be hobbled by files with holes in them -- where the holes are
blocks that sat on the downed server.
Would suggest running with replication of 3 unless you have very good
reason -- and insurance against failure -- for doing otherwise.
Go easy Dru,
St.Ack
On Sep 29, 2008, at 1:27 PM, stack wrote:
Dru Jensen wrote:
HBase was not responding to Thrift requests so I tried to restart
but it still looks frozen. I am seeing several error messages in
the hmaster logs after I attempted to restart hbase:
2008-09-29 12:55:23,744 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: error opening
region {table},{key},1222453917858
java.io.IOException: Premeture EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
at
org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
Enable DEBUG and it might tell you what it was trying to open at time
of the exception.
and:
2008-09-29 12:58:50,067 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: error opening
region {table},{key},1222453917858
java.io.IOException: Could not obtain block: blk_-2905695662732817278
This is bad. What happens if you run './bin/hadoop fsck
/hbase.rootdir'? Your replication is one. Means if any hdfs hiccup,
data is lost. You might putting replication back to the default?
St.Ack