mttr update

N Keywal Tue, 21 Aug 2012 06:47:24 -0700

Hi all,

Just an update on the work in progress for the MTTR, looking at HDFS
failures impact on HBase. There are some comments in HBASE-5843 for
the pure HBase part.


1) Single node failure: a region server and its datanode.
I think the main global issue is the one mentionned in HDFS-3703:
currently HBase starts its recovery before HDFS identifies a node as
dead. When we have a dead region servers, the regions will be assigned
to another one. In other words:
- the HLog will be read to be split & replayed
- new files will be written (i.e. new blocks allocated) as the result
of this split
- if the data locality of your cluster is good, you've just lost one
third of the replica of the hfile for the region you're migrating.

By default, HBase recovery starts after 3 minutes (should be lower in
production, likely 1 minute), while HDFS marks the datanode as dead
after 10 minutes 30.

But in the meantime, HDFS still return the replica of the dead
datanode as valid, and propose the dead datanode as an option for new
writes.
So:
- you have delays and errors when splitting the hlog
- reading the hfile from the region servers receving the region will
get read errors as well (33% of the replica are missing).
- you will have write errors as well. For example, with a dead
datanode, out of 20 machines, with a replication of 3, you will be
directed to this dead datanode 15% of the time. Per block. With 100
machines, it's just 3%, still per block. But with 100 blocks to write,
you will get a write error 95% of the time. This error will be
recovered, but will slow the recovery by another minute if you were
creating new hlog files.

So the recovery, for the client, if everything goes well (i.e. we're
reasonably lucky) will be around:
1 minute (hbase failure detection time)
1 minute (timeout when reading the hlog to split
1 minute (timeout when writing the new hlogs)
1 minute (timeout when reading the old hfiles and getting)

It can be less if the datanodes were not already holding tcp
connections with the dead nodes, as the connect timeout is 20s.

The last 3 steps disappear with HDFS-3703. The partial workarounds are
HBASE-6435 and HDFS-3705.
Today, even with a single node failure, you can have HDFS-3701 /
HBASE-6401 as well.

3) Large failure (multiple nodes)
3.1) If we've lost all the replica for a block, well, we're in a bad
shape. Not really studied, but some logs are documented in HBASE-6626.
3.2) For HLog, we monitor the number of replica, and open a new file
if there are missing replicas. After 5 attempts, we continue with a
single replica, the one on the same machine as its region server,
maximizing the risk. That's why I tend to think HDFS-3702 could be
necessary; but I haven't found another use case than a write-ahead-log
from an HDFS point of view. HDFS-3703 lowers the probability to be
directed multiple times to bad datanodes.
3.3) Impact on memstore flush: not studied yet.
3.4) Impact of HA namenode switch: I guess it's a part of namenode HA.
Not studied.

3) Testing
HW Failures are difficult to simulate. When you kill -9 a process, the
sockets are closed by the operating system, and the client will be
notified. It's not the same thing as a really dead machine, that won't
reply to not send ip packets. The only "simple" way I found to test
this is to unplug the network cable. I haven't tried with
virtualization, but we don't have anything within HBase minicluster to
do a proper test (would require specific hooks I think, if it is even
possible). I don't know if they have something in HDFS.

That's all folks :-)

N.

mttr update

Reply via email to