[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476563#comment-13476563
 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

One other question, could you explain these numbers in more detail?

{quote}
The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, 
this should scale pretty well. We could improve things by using locality.
{quote}
Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files 
of 60 Mb each").

{quote}
Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 
60Mb each, on a single server.
{quote}
The number and size of log files is independent of the number of rows?  Is the 
size of a row in the 10M setup 1/10 the size of the row in the 1M setup?

{quote}
We can expect, in production, server side point of view
30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine 
nicely rebooted, ...)
10s split (i.e: distributed along multiple region servers)
10s assignment (i.e. distributed as well).
{quote}

For detection: In all the tests where we are not deleting the znode, either 
because on 0.94 or hw failure, the detection time is 180s.  Why is this listed 
as 30 sec?
For split: I think I understand this.  8 log files, 60 Mb each means 8 split 
tasks on 4 regionservers = 2 split tasks per RS.  This took ~25 seconds, so if 
we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.
For assignment: Not sure where this number if coming from.  I see ~30 
assignment and we know this would go faster with more RS, but how fast?

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to