Thanks for the nice update N. Regards Ram
> -----Original Message----- > From: n keywal [mailto:[email protected]] > Sent: Wednesday, September 12, 2012 12:27 AM > To: [email protected] > Subject: mttr update > > Hi All, > > There is some progress on MTTR. It's detailed in HBASE-5843, but here > is a > synthesis. > That's the server side view, including the edits split & replaying, and > considering a timeout of 30s in ZK. > > 1) Region Server crash. We can expect 50s on 0.94; 20s on 0.96: > - the failure will be detected immediately on 0.96, after 30s on 0.94 > - the distributed split seems to work well (i.e. distribute well) > - the assignment seems to be dominated by replaying the locally edits, > and > should scale well on a reasonable cluster. > > 2) Single Box failure (regionserver + datanode): 0.94: often around 10 > minutes. 0.96 (actually HDFS-3703): 50s. > - It's random. The more data to split, the more chance you have to be > directed to the dead datanode. With little data in the memstore, it's > like > 1). > - The results come from HDFS-3703: we're not directed to the dead > datanodes anymore. It's not yet in the official hdfs release. > - When directed to a dead datanode, HBase/HDFS retries on the same > datanode instead of moving to another one (HBASE-6751) > - Distributed Split resubmits the tasks too fast (HBASE-6738) > > 3) Going further: > - 3703 simplifies a lot of things, because we've got much less errors > from > the underlying file system when a box dies. So in production it's gonna > be > quite useful in many cases. It would be dangerous to rely too much on > it, > i.e. being non consistent or totally inefficient when we've got > datanode > errors. HBASE-6738 is a good example: when there is no datanode error > it > does no show up; it does not mean we don't have a problem. > - There are still the nasty cases, i.e. loosing meta/root, or mixing a > failure with a heavy workload (workload increases during failure) and > many > other things like this. > - For reliability and safety, not writing the log locally could be > important. That's HDFS-3706. > - These tests are from the server point if view. There could be corner > cases if looked at from a client point of view. > - And we could do things differently to serve writes and some reads > immediately (HBASE-6752) > - Decreasing the detection time will become more and more important. > (HBASE-6290, ZOOKEEPER-702, ZOOKEEPER-922, ...) > > That's all folks! :-) > > Nicolas
