[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

nkeywal (JIRA) Fri, 07 Sep 2012 08:28:11 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450718#comment-13450718
 ]


nkeywal commented on HBASE-5843:
--------------------------------

Some tests and analysis around Distributed Split / Datanodes failures

On a real cluster, 3 nodes.
- dfs.replication = 2
- local HD. The test failed with the ramDrive. 
- Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. 
The first holds meta & root.
- Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files 
of 60Mb each, on a single server.
- Start another box with a DN and a RS. This box is empty (no regions, no 
blocks).
- Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8 
(for 10M puts) log files.

Durations are, in seconds. With HDFS 1.0.3 if not stated differently.

1M puts on 0.94:
~180s detection time, sometimes around 150s
~130s split time (there is a single file to split. This is to be compared to 
the 10s per split above)
~180s assignment, included replaying edits. There could be some locks, as we're 
reassigning/replaying 50 regions per server.

1M puts on 0.96 3 tests. One failure.
~180s detection time, sometimes around 150s
~180s split time. Once again a single file to split. It's unclear why it takes 
longer than 0.94
~180s assignment, as 0.94.

Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94.

10M puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as 
dead. It takes 7:30 minutes, so the split finishes after this.
~60s assignment? Tested only once.

0M (zero) puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~0s split. 
~3s assignment (This seems to say that the assignment time is spent in the edit 
replay.)

10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths)
~180s detection time, sometimes around 150s
~150s split. This for a bad reasons: all tasks excepts one succeeds. The last 
one seems to connect to the dead server, and finishes after ~100s. Tested twice.
~50s assignment. Measured once.

So:
- The measures on assignments are fishy. But it seems to say that we are now 
spending our time in replaying edit. We could have issues linked to HDFS as 
well here: in the last two scenarios we're not going to the dead nodes when we 
replay/flush edits, so that could be the reason.
- The split in 10s per 60Gb, on a single and slow HD. With a reasonable 
cluster, this should scale pretty well. We could improve things by using 
locality.
- There will be datanodes errors if you don't have HDFS-3703. And in this case 
it becomes complicated. HBASE-6738.
- With HDFS-3703, we're 500s faster. That's interesting.
- Even with HDFS-3703 there is still something to look at in how HDFS connects 
to the dead node. It seems the block is empty, so retried multiple times. There 
are multiple possible paths here.
- We can expect, in production, server side point of view
   - 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, 
machine nicely rebooted, ...)
   - 10s split (i.e: distributed along multiple region servers)
   - 10s assignment (i.e. distributed as well).
- Without HDFS effects here. See above.
- This scenario is extreme, as we're loosing 50% of our data. Still, if you're 
loosing a regionserver with 300 regions, the split can go well if you're not 
lucky.
- It means as well that the detection time dominates the other parameters when 
everything goes well.

Conclusion:
- Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one 
of the keys.
- The Distributed Split seems to work well in terms of performances.
- Assignment itself seems ok. Replaying should be looked at (more in terms of 
lock than raw performances).
- Detection time will become more an more important.
- An improvement would be to reassign the region in parallel of the split, with:
   - continue to serve writes before the end of the split as well: the fact 
that we're splitting the logs does not mean we cannot write. There are real 
applications that could use this (may be open tsdb for example: whatever 
application that logs data: they just need to know where to write).
   - continue to server reads if there are timeranged with the max time stamp 
before the failure: There are many applications that don't need fresh data 
(i.e. less than 1 minute old). 
- With this, the downtime will be totally dominated by the detection time.
- There are JIRAs around the detection time already (basically: improve ZK and 
open HBase to external monitoring systems).
- There will be some work around the client part.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Reply via email to