[
https://issues.apache.org/jira/browse/HBASE-8338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638381#comment-13638381
]
Nicolas Liochon commented on HBASE-8338:
----------------------------------------
Some general configuration:
I hope 6295 solves the 'temporary slow or non responding machine' issue. Then,
as we are today, I think that we more or less doomed to be 'as slow as the
slowest'. The usual solutions for this are use an asynchronous client / send
large buffer to get a chance to go the average result.
But more fundamentally, we really need the cluster to be well balanced. If we
want this, we need more than a global balancer imho: we need the regionserver
to prioritize the queries.
In other words there are 3 cases of unbalanced clusters:
- long term (difference in machine for example): this should be managed by the
balancer we have today
- short term (dead machine, GC issue, ...): HBASE-6295, i.e. asynchronous/large
buffer
- medium term (specific temporary load on a machine, compactions, ...): client
and regionserver priorities. 6295 has a bit of a priority feature t(the number
of task).
> Latency Resilience; umbrella list of issues that will help us ride over bad
> disk, bad region, ec2, etc.
> -------------------------------------------------------------------------------------------------------
>
> Key: HBASE-8338
> URL: https://issues.apache.org/jira/browse/HBASE-8338
> Project: HBase
> Issue Type: Umbrella
> Components: LatencyResilience
> Reporter: stack
> Priority: Critical
>
> Chatting w/ Elliott, we started listing out items to fix that would help keep
> hbase latency approximately constant as disks went bad, were saturated by a
> neighbour (ec2), etc.
> I must made a new LatencyResilience issue category to tag issues that
> contribute to this project.
> I have to go at moment but when I get back I'll start to link in existing
> issues that help this project along and I'll file new ones.
> Here is what we chatted about:
> + Multiple WALs effort will help keep write latency roughly constant.
> + Figuring how to get a new read started over dfsclient if current replica
> read is taking too long would help keep reads about constant (maybe could
> exploit the nkeywal hackery messing w/ replicas order).
> + There is an issue where client can currently pile up on a single region
> because of the way we do client queues by regionserver. This needs fixing.
> The above are few ideas worth further exploration at least.
> Idea is to try and bring down our 95percentiles and to make us more robust in
> the face of dying disks, etc. I see this issue rising to the fore now there
> has been good progress on the MTTR project.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira