[
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450718#comment-13450718
]
nkeywal commented on HBASE-5843:
--------------------------------
Some tests and analysis around Distributed Split / Datanodes failures
On a real cluster, 3 nodes.
- dfs.replication = 2
- local HD. The test failed with the ramDrive.
- Start with 2 DN and 2 RS. Create a table with 100 regions in the second one.
The first holds meta & root.
- Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files
of 60Mb each, on a single server.
- Start another box with a DN and a RS. This box is empty (no regions, no
blocks).
- Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8
(for 10M puts) log files.
Durations are, in seconds. With HDFS 1.0.3 if not stated differently.
1M puts on 0.94:
~180s detection time, sometimes around 150s
~130s split time (there is a single file to split. This is to be compared to
the 10s per split above)
~180s assignment, included replaying edits. There could be some locks, as we're
reassigning/replaying 50 regions per server.
1M puts on 0.96 3 tests. One failure.
~180s detection time, sometimes around 150s
~180s split time. Once again a single file to split. It's unclear why it takes
longer than 0.94
~180s assignment, as 0.94.
Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94.
10M puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as
dead. It takes 7:30 minutes, so the split finishes after this.
~60s assignment? Tested only once.
0M (zero) puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~0s split.
~3s assignment (This seems to say that the assignment time is spent in the edit
replay.)
10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths)
~180s detection time, sometimes around 150s
~150s split. This for a bad reasons: all tasks excepts one succeeds. The last
one seems to connect to the dead server, and finishes after ~100s. Tested twice.
~50s assignment. Measured once.
So:
- The measures on assignments are fishy. But it seems to say that we are now
spending our time in replaying edit. We could have issues linked to HDFS as
well here: in the last two scenarios we're not going to the dead nodes when we
replay/flush edits, so that could be the reason.
- The split in 10s per 60Gb, on a single and slow HD. With a reasonable
cluster, this should scale pretty well. We could improve things by using
locality.
- There will be datanodes errors if you don't have HDFS-3703. And in this case
it becomes complicated. HBASE-6738.
- With HDFS-3703, we're 500s faster. That's interesting.
- Even with HDFS-3703 there is still something to look at in how HDFS connects
to the dead node. It seems the block is empty, so retried multiple times. There
are multiple possible paths here.
- We can expect, in production, server side point of view
- 30s detection time for hw failure, 0s for simpler case (kill -9, OOM,
machine nicely rebooted, ...)
- 10s split (i.e: distributed along multiple region servers)
- 10s assignment (i.e. distributed as well).
- Without HDFS effects here. See above.
- This scenario is extreme, as we're loosing 50% of our data. Still, if you're
loosing a regionserver with 300 regions, the split can go well if you're not
lucky.
- It means as well that the detection time dominates the other parameters when
everything goes well.
Conclusion:
- Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one
of the keys.
- The Distributed Split seems to work well in terms of performances.
- Assignment itself seems ok. Replaying should be looked at (more in terms of
lock than raw performances).
- Detection time will become more an more important.
- An improvement would be to reassign the region in parallel of the split, with:
- continue to serve writes before the end of the split as well: the fact
that we're splitting the logs does not mean we cannot write. There are real
applications that could use this (may be open tsdb for example: whatever
application that logs data: they just need to know where to write).
- continue to server reads if there are timeranged with the max time stamp
before the failure: There are many applications that don't need fresh data
(i.e. less than 1 minute old).
- With this, the downtime will be totally dominated by the detection time.
- There are JIRAs around the detection time already (basically: improve ZK and
open HBase to external monitoring systems).
- There will be some work around the client part.
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
> Issue Type: Umbrella
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
>
> A part of the approach is described here:
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira