[
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476582#comment-13476582
]
nkeywal commented on HBASE-5843:
--------------------------------
Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files
of 60 Mb each").
Yes :-).
bq. The number and size of log files is independent of the number of rows? Is
the size of a row in the 10M setup 1/10 the size of the row in the 1M setup?
No / ~Yes. There's a maximum number of log files before a flush however.
bq. on 0.94 or hw failure, the detection time is 180s. Why is this listed as
30 sec?
It's discussed very briefly in HBASE-5844: it was bumped recently from 60s to
180s to help newcomers. ZK default is 30s and is imho the best choice for
someone requiring a reasonable mttr without too much risk.
bq. For split: I think I understand this. 8 log files, 60 Mb each means 8 split
tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if
we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.
Yes, exactly. It's a reasonable expectation imho, even if there will be other
cases.
bq. For assignment: Not sure where this number if coming from. I see ~30
assignment and we know this would go faster with more RS, but how fast?
The calculations seems to say that the time is spent in the replay (that's the
initial results from 20/Jun/12 16:52 and the test with zero rows from 07/Sep/12
17:27). It would be useful to redo the tests, as many things changed recently
on assignment. But if the results are ok, and if the cluster is big enough, the
regions are distributed, we should expect only one replay. Again, it's average,
not worse case.
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
> Issue Type: Umbrella
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
>
> A part of the approach is described here:
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira