[
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459444#comment-13459444
]
nkeywal commented on HBASE-5843:
--------------------------------
bq. Interesting. The sad part is we often find ourselves having to increase the
ZK timeout in order to deal with Juliet GC pauses. Given that detection time
dominates, perhaps we should put some effort into correcting that (multiple RS
on a single box?)
Imho, multiple RS on the same box would put us in a dead end: it increases the
number of tcp connections, add workload on ZooKeeper, makes the balancer more
complicated, ... We can also have operational issues (rolling upgrade, fixed
tcp ports, ...).
The possible options I know are:
- improving ZooKeeper to have an algorithm that takes variance into account:
it's a common solution to have a good failure detection while minimizing wrong
positive. 20 years ago, it saved TCP from dying by congestion. There is
ZOOKEEPER-702 about this. That's medium term (hopes...), but would be useful
for HDFS HA also.
- Using the new gc options available in JDK 1.7 (see
http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html).
That's short term, simple. Only issue, it has been tried a few month ago (by
Andrew Purtell IIRC), and crashed the JVM. Still, it's something to look at,
and may be we should raise the bugs to Oracle if we find some.
- The offheap mentioned by Stack.
I don't think it's one or another, we're likely to need all of them :-). Still,
knowing where we stand in regards of JDK 1.7 is important imho.
bq. Yes, CPU definitely need a diet. Probably start with eliminating a bunch of
threads.
It's not directly MTTR, but I agree, we have far too many threads, and far too
many thread pools. Not only it's bad for performance, it makes analysing the
performances complicated.
bq. Right, I think HBASE-6752 is a great idea, but it doesn't address serving
reads more quickly. I'm wondering if there is more we can do to address that.
There is HBASE-6774 for the special case of "empty hlog" regions. It would be
interesting to see how many regions are in this situation on different
production clusters. There are so many ways to be in this situation... I would
love to have a stat on "at a given point of time, what's the proportion of the
regions with a non empty memstore". And improving memstore flush policy would
lead us to improvement here as well I think.
With HBASE-6752 we serve as well timeranged reads (if they're lucky on the
range).
But yep, we don't cover all cases. Ideas welcome :-)
bq. Why do you say this with respect to locking? Is the performance not as good
as you would expect? Or just haven't looked at it yet?
I was expecting much better performances, but I haven't looked enough at it.
bq. I've wondered why we don't do this. Do you see any implementation
challenges with doing this? Maybe I'll look into it.
Well, it's closed to the assignment part, so... :-) But it would be great if
you can have a look at this, because with all the discussions around
assignment, it's important to take these new use cases into account as well..
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
> Key: HBASE-5843
> URL: https://issues.apache.org/jira/browse/HBASE-5843
> Project: HBase
> Issue Type: Umbrella
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
>
> A part of the approach is described here:
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira