[ https://issues.apache.org/jira/browse/HBASE-16388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587002#comment-15587002 ]
Mikhail Antonov commented on HBASE-16388: ----------------------------------------- [~yangzhe1991] I good one, I missed that. Is it fair to say that the primary motivation for that is that global, per-region and per-server limits in AP are flawed since they only ever enforced on the write path (going through AP#submit() / buffered mutator)? With that being client-only change I'd consider backporting it to 1.3.. Anything that reduced blast radius from bad RS is an important reliability fix IMO. > Prevent client threads being blocked by only one slow region server > ------------------------------------------------------------------- > > Key: HBASE-16388 > URL: https://issues.apache.org/jira/browse/HBASE-16388 > Project: HBase > Issue Type: New Feature > Reporter: Phil Yang > Assignee: Phil Yang > Fix For: 2.0.0, 1.4.0 > > Attachments: HBASE-16388-branch-1-v1.patch, > HBASE-16388-branch-1-v2.patch, HBASE-16388-v1.patch, HBASE-16388-v2.patch, > HBASE-16388-v2.patch, HBASE-16388-v2.patch, HBASE-16388-v2.patch, > HBASE-16388-v3.patch > > > It is a general use case for HBase's users that they have several > threads/handlers in their service, and each handler has its own Table/HTable > instance. Generally users think each handler is independent and won't > interact each other. > However, in an extreme case, if a region server is very slow, every requests > to this RS will timeout, handlers of users' service may be occupied by the > long-waiting requests even requests belong to other RS will also be timeout. > For example: > If we have 100 handlers in a client service(timeout is 1000ms) and HBase has > 10 region servers whose average response time is 50ms. If no region server is > slow, we can handle 2000 requests per second. > Now this service's QPS is 1000. If there is one region server very slow and > all requests to it will be timeout. Users hope that only 10% requests failed, > and 90% requests' response time is still 50ms, because only 10% requests are > located to the slow RS. However, each second we have 100 long-waiting > requests which exactly occupies all 100 handles. So all handlers is blocked, > the availability of this service is almost zero. > To prevent this case, we can limit the max concurrent requests to one RS in > process-level. Requests exceeding the limit will throws > ServerBusyException(extends DoNotRetryIOE) immediately to users. In the above > case, if we set this limit to 20, only 20 handlers will be occupied and other > 80 handlers can still handle requests to other RS. The availability of this > service is 90% as expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)