[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

nkeywal (JIRA) Mon, 09 Jul 2012 10:30:35 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409663#comment-13409663
 ]


nkeywal commented on HBASE-5843:
--------------------------------

Some tests results:

I tested the following scenarios, on a local machine, a pseudo
distributed cluster with ZooKeeper and HBase writing in a ram drive,
no datanode nor namenode, with 2 region servers, and one empty table
with 10000 regions, 5K on each RS. Versions taken monday 2nd

1) Clean stop of one RS; wait for all regions to become online again:
0.92: ~800 seconds
0.96: ~13 seconds

=> Huge improvement, hopefully from stuff like HBASE-5970 and HBASE-6109.

1.1) As above with 2Mb memory per server
Results as 1)

=> Results don't depend on any GC stuff (memory reported is around 200 Mb)


2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s

=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but 
should bring similar results.



3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s

=> The benefit is visible at startup
=> This does not come from something implemented for 0.94



4) As 3) But with HBase on a local HD
0.92: ~1044s (tested once only)
0.96: ~28s (tested once only)

=> Similar results. Seems that HBase i/o was not and is not becoming the 
bottleneck.


5) As 1) With 4RS instead of 2
0.92) 406s
0.96) 6s

=> Twice faster in both cases. Scales with the number of RS with both versions 
on this minimalistic test.



6) As 3) But with ZK on a local HD
Impossible to get something consistent here. Machine and test dependent.
The most credible result was similar to 2).
>From ZK mailing list or ZOOKEEPER-866 is seems that what we should expect.



7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long 
it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)

=> There's a bug ;-)
=> Test to be changed to get a real difference when we need to replay the wal.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Reply via email to