[
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418863#comment-13418863
]
Gregory Chanan commented on HBASE-5843:
---------------------------------------
Looks great so far, nkeywal.
Some questions:
{quote}
2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s
=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but
should bring similar results.
{quote}
I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800
(test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right?
Could you clarify?
{quote}
3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s
=> The benefit is visible at startup
=> This does not come from something implemented for 0.94
{quote}
Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? (since I
assume HBASE-5844 and HBASE-5926 do not apply in this case).
{quote}
7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long
it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)
=> There's a bug
{quote}
Has a JIRA been filed?
{quote}
Test to be changed to get a real difference when we need to replay the wal.
{quote}
Could you clarify what you mean here?
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
> Issue Type: Umbrella
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
>
> A part of the approach is described here:
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira