Hi all, Just an update on the work in progress for the MTTR, looking at HDFS failures impact on HBase. There are some comments in HBASE-5843 for the pure HBase part.
1) Single node failure: a region server and its datanode. I think the main global issue is the one mentionned in HDFS-3703: currently HBase starts its recovery before HDFS identifies a node as dead. When we have a dead region servers, the regions will be assigned to another one. In other words: - the HLog will be read to be split & replayed - new files will be written (i.e. new blocks allocated) as the result of this split - if the data locality of your cluster is good, you've just lost one third of the replica of the hfile for the region you're migrating. By default, HBase recovery starts after 3 minutes (should be lower in production, likely 1 minute), while HDFS marks the datanode as dead after 10 minutes 30. But in the meantime, HDFS still return the replica of the dead datanode as valid, and propose the dead datanode as an option for new writes. So: - you have delays and errors when splitting the hlog - reading the hfile from the region servers receving the region will get read errors as well (33% of the replica are missing). - you will have write errors as well. For example, with a dead datanode, out of 20 machines, with a replication of 3, you will be directed to this dead datanode 15% of the time. Per block. With 100 machines, it's just 3%, still per block. But with 100 blocks to write, you will get a write error 95% of the time. This error will be recovered, but will slow the recovery by another minute if you were creating new hlog files. So the recovery, for the client, if everything goes well (i.e. we're reasonably lucky) will be around: 1 minute (hbase failure detection time) 1 minute (timeout when reading the hlog to split 1 minute (timeout when writing the new hlogs) 1 minute (timeout when reading the old hfiles and getting) It can be less if the datanodes were not already holding tcp connections with the dead nodes, as the connect timeout is 20s. The last 3 steps disappear with HDFS-3703. The partial workarounds are HBASE-6435 and HDFS-3705. Today, even with a single node failure, you can have HDFS-3701 / HBASE-6401 as well. 3) Large failure (multiple nodes) 3.1) If we've lost all the replica for a block, well, we're in a bad shape. Not really studied, but some logs are documented in HBASE-6626. 3.2) For HLog, we monitor the number of replica, and open a new file if there are missing replicas. After 5 attempts, we continue with a single replica, the one on the same machine as its region server, maximizing the risk. That's why I tend to think HDFS-3702 could be necessary; but I haven't found another use case than a write-ahead-log from an HDFS point of view. HDFS-3703 lowers the probability to be directed multiple times to bad datanodes. 3.3) Impact on memstore flush: not studied yet. 3.4) Impact of HA namenode switch: I guess it's a part of namenode HA. Not studied. 3) Testing HW Failures are difficult to simulate. When you kill -9 a process, the sockets are closed by the operating system, and the client will be notified. It's not the same thing as a really dead machine, that won't reply to not send ip packets. The only "simple" way I found to test this is to unplug the network cable. I haven't tried with virtualization, but we don't have anything within HBase minicluster to do a proper test (would require specific hooks I think, if it is even possible). I don't know if they have something in HDFS. That's all folks :-) N.
