On Wed, Oct 21, 2009 at 8:16 AM, elsif <[email protected]> wrote: > > There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not > found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found > in the hadoop logs over 12 hours of running the test. >
Above are from application-level (hbase) or datanode logs? If you trace any of the above -- follow the block name -- in the NN are the blocks lost or do you see replicas taking over or recoveries triggered? > > I understand that I am loading the cluster - that is the point of the > test, but I don't think that this should result in data loss. Failed > inserts at the client level I can handle, but loss of data that was > previously thought to be stored in hbase is a major issue. Are there > plans to make hbase more resilient to load based failures? > > It looks like there'll be data loss going by a few of the exceptions you provide originally. Here's a couple of comments: "No live nodes contain current block" Usually we see this if the client-side hadoop has not been patched with hdfs-127/hadoop-4681. Your test program doesn't seem to have come across. Mind attaching it to an issue so I can try it? Going by the way you started your test program, you should have the hbase patched hadoop first in your CLASSPATH so you should be ok but maybe there is something about your environmnent frustrating hbase's using a patched hadoop? "java.io.IOException: TIMED OUT" Your regionserver or master timed out its zk session. GC or swapping or disk used by zk is under heavy i/o loading? "ClosedChannelException" Probably symptom of a RS shutdown because of events such as above. "Abandoning block..." Did this write to HLog fail? Its just an INFO level log out of DFSClient. "file system not available" What happened before this? Was this just an emission on regionserver shutdown? St.Ack
