[
https://issues.apache.org/jira/browse/HBASE-28453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ray Mattingly reassigned HBASE-28453:
-------------------------------------
Assignee: Ray Mattingly
> Support a middle ground between the Average and Fixed interval rate limiters
> ----------------------------------------------------------------------------
>
> Key: HBASE-28453
> URL: https://issues.apache.org/jira/browse/HBASE-28453
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.6.0
> Reporter: Ray Mattingly
> Assignee: Ray Mattingly
> Priority: Major
> Attachments: Screenshot 2024-03-21 at 2.08.51 PM.png, Screenshot
> 2024-03-21 at 2.30.01 PM.png
>
>
> h3. Background
> HBase quotas support two rate limiters: a "fixed" and an "average" interval
> rate limiter.
> h4. FixedIntervalRateLimiter
> The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second,
> and it refills a resource allotment on the recurring interval. So you may get
> 10 resources every second, and if you exhaust all 10 resources in the first
> millisecond of an interval then you will need to wait 999ms to acquire even 1
> more resource.
> h4. AverageIntervalRateLimiter
> The average interval rate limiter, HBase's default, allows for more flexibly
> timed refilling of the resource allotment. Extending our previous example,
> say you have a 10 reads/sec quota and you have exhausted all 10 resources
> within 1ms of the last full refill. If you request 1 more read then, rather
> than returning a 999ms wait interval indicating the next full refill time,
> the rate limiter will recognize that you only need to wait 99ms before 1 read
> can be available. After 100ms has passed in aggregate since the last full
> refill, it will support the refilling of 1/10th the limit to facilitate the
> request for 1/10th the resources.
> h3. The Problems with Current RateLimiters
> The problem with the fixed interval rate limiter is that it is too strict
> from a latency perspective. It results in quota limits to which we cannot
> fully subscribe with any consistency.
> The problem with the average interval rate limiter is that, in practice, it
> is far too optimistic. For example, a real rate limiter might limit to
> 100MB/sec of read IO per machine. Any multigets that come in will require
> only a tiny fraction of this limit; for example, a 64kb block is only 0.06%
> of the total. As a result, the vast majority of wait intervals end up being
> tiny — like <5ms. This can actually cause an inverse of your intention, where
> setting up a throttle causes a DDOS of your RPC layer via continuous
> throttling and ~immediate retrying. I've discussed this problem in
> https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait
> interval as the solution there; after some more thinking, I believe this new
> rate limiter would be a less hacky solution to this deficit so I'd like to
> close that Jira in favor of this one.
> See the attached chart where I put in place a 10k req/sec/machine throttle
> for this user at 10:43 to try to curb this high traffic, and it resulted in a
> huge spike of req/sec due to the throttle/retry loop created by the
> AverageIntervalRateLimiter.
> h3. PartialIntervalRateLimiter as a Solution
> I've implemented a RateLimiter which allows for partial chunks of the overall
> interval to be refilled, by default these chunks are 10% (or 100ms of a 1s
> interval). I've deployed this to a test cluster at my day job and have seen
> this really help our ability to full subscribe to a quota limit without
> executing superfluous retries. See the other attached chart which shows a
> cluster undergoing a rolling restart from using FixedIntervalRateLimiter to
> my new PartialIntervalRateLimiter and how it is then able to fully subscribe
> to its allotted 25MB/sec/machine read IO quota.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)