[jira] [Updated] (HBASE-28453) Support a middle ground between the Average and Fixed interval rate limiters

Ray Mattingly (Jira) Thu, 21 Mar 2024 15:20:05 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-28453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ray Mattingly updated HBASE-28453:
----------------------------------
    Description: 
h3. Background

HBase quotas support two rate limiters: a "fixed" and an "average" interval 
rate limiter.
h4. FixedIntervalRateLimiter

The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, 
and it refills a resource allotment on the recurring interval. So you may get 
10 resources every second, and if you exhaust all 10 resources in the first 
millisecond of an interval then you will need to wait 999ms to acquire even 1 
more resource.
h4. AverageIntervalRateLimiter

The average interval rate limiter, HBase's default, allows for more flexibly 
timed refilling of the resource allotment. Extending our previous example, say 
you have a 10 reads/sec quota and you have exhausted all 10 resources within 
1ms of the last full refill. If you request 1 more read then, rather than 
returning a 999ms wait interval indicating the next full refill time, the rate 
limiter will recognize that you only need to wait 99ms before 1 read can be 
available. After 100ms has passed in aggregate since the last full refill, it 
will support the refilling of 1/10th the limit to facilitate the request for 
1/10th the resources.
h3. The Problems with Current RateLimiters

The problem with the fixed interval rate limiter is that it is too strict from 
a latency perspective. It results in quota limits to which we cannot fully 
subscribe with any consistency.

The problem with the average interval rate limiter is that, in practice, it is 
far too optimistic. For example, a real rate limiter might limit to 100MB/sec 
of read IO per machine. Any multigets that come in will require only a tiny 
fraction of this limit; for example, a 64kb block is only 0.06% of the total. 
As a result, the vast majority of wait intervals end up being tiny — like <5ms. 
This can actually cause an inverse of your intention, where setting up a 
throttle causes a DDOS of your RPC layer via continuous throttling and 
~immediate retrying. I've discussed this problem in 
https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait 
interval as the solution there; after some more thinking, I believe this new 
rate limiter would be a less hacky solution to this deficit so I'd like to 
close that Jira in favor of this one.

See the attached chart where I put in place a 10k req/sec/machine throttle for 
this user at 10:43 to try to curb this high traffic, and it resulted in a huge 
spike of req/sec due to the throttle/retry loop created by the 
AverageIntervalRateLimiter.
h3. Original Proposal: PartialIntervalRateLimiter as a Solution

I've implemented a RateLimiter which allows for partial chunks of the overall 
interval to be refilled, by default these chunks are 10% (or 100ms of a 1s 
interval). I've deployed this to a test cluster at my day job and have seen 
this really help our ability to full subscribe to a quota limit without 
executing superfluous retries. See the other attached chart which shows a 
cluster undergoing a rolling restart from using FixedIntervalRateLimiter to my 
new PartialIntervalRateLimiter and how it is then able to fully subscribe to 
its allotted 25MB/sec/machine read IO quota.
h3. Updated Proposal: Improving FixedIntervalRateLimiter

Rather than implement a new rate limiter, we can make a lower touch change 
which just adds support for a refill interval that is less than the time unit 
on a FixedIntervalRateLimiter. This can be a no-op change for those who have 
not opted into the feature by having the refill interval default to the time 
unit. For clarity, see [my branch 
here|https://github.com/apache/hbase/compare/master...HubSpot:hbase:HBASE-28453]
 which I will PR soon

  was:
h3. Background

HBase quotas support two rate limiters: a "fixed" and an "average" interval 
rate limiter.
h4. FixedIntervalRateLimiter

The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, 
and it refills a resource allotment on the recurring interval. So you may get 
10 resources every second, and if you exhaust all 10 resources in the first 
millisecond of an interval then you will need to wait 999ms to acquire even 1 
more resource.
h4. AverageIntervalRateLimiter

The average interval rate limiter, HBase's default, allows for more flexibly 
timed refilling of the resource allotment. Extending our previous example, say 
you have a 10 reads/sec quota and you have exhausted all 10 resources within 
1ms of the last full refill. If you request 1 more read then, rather than 
returning a 999ms wait interval indicating the next full refill time, the rate 
limiter will recognize that you only need to wait 99ms before 1 read can be 
available. After 100ms has passed in aggregate since the last full refill, it 
will support the refilling of 1/10th the limit to facilitate the request for 
1/10th the resources.
h3. The Problems with Current RateLimiters

The problem with the fixed interval rate limiter is that it is too strict from 
a latency perspective. It results in quota limits to which we cannot fully 
subscribe with any consistency.

The problem with the average interval rate limiter is that, in practice, it is 
far too optimistic. For example, a real rate limiter might limit to 100MB/sec 
of read IO per machine. Any multigets that come in will require only a tiny 
fraction of this limit; for example, a 64kb block is only 0.06% of the total. 
As a result, the vast majority of wait intervals end up being tiny — like <5ms. 
This can actually cause an inverse of your intention, where setting up a 
throttle causes a DDOS of your RPC layer via continuous throttling and 
~immediate retrying. I've discussed this problem in 
https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait 
interval as the solution there; after some more thinking, I believe this new 
rate limiter would be a less hacky solution to this deficit so I'd like to 
close that Jira in favor of this one.

See the attached chart where I put in place a 10k req/sec/machine throttle for 
this user at 10:43 to try to curb this high traffic, and it resulted in a huge 
spike of req/sec due to the throttle/retry loop created by the 
AverageIntervalRateLimiter.
h3. PartialIntervalRateLimiter as a Solution

I've implemented a RateLimiter which allows for partial chunks of the overall 
interval to be refilled, by default these chunks are 10% (or 100ms of a 1s 
interval). I've deployed this to a test cluster at my day job and have seen 
this really help our ability to full subscribe to a quota limit without 
executing superfluous retries. See the other attached chart which shows a 
cluster undergoing a rolling restart from using FixedIntervalRateLimiter to my 
new PartialIntervalRateLimiter and how it is then able to fully subscribe to 
its allotted 25MB/sec/machine read IO quota.


> Support a middle ground between the Average and Fixed interval rate limiters
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-28453
>                 URL: https://issues.apache.org/jira/browse/HBASE-28453
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>            Reporter: Ray Mattingly
>            Assignee: Ray Mattingly
>            Priority: Major
>         Attachments: Screenshot 2024-03-21 at 2.08.51 PM.png, Screenshot 
> 2024-03-21 at 2.30.01 PM.png
>
>
> h3. Background
> HBase quotas support two rate limiters: a "fixed" and an "average" interval 
> rate limiter.
> h4. FixedIntervalRateLimiter
> The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, 
> and it refills a resource allotment on the recurring interval. So you may get 
> 10 resources every second, and if you exhaust all 10 resources in the first 
> millisecond of an interval then you will need to wait 999ms to acquire even 1 
> more resource.
> h4. AverageIntervalRateLimiter
> The average interval rate limiter, HBase's default, allows for more flexibly 
> timed refilling of the resource allotment. Extending our previous example, 
> say you have a 10 reads/sec quota and you have exhausted all 10 resources 
> within 1ms of the last full refill. If you request 1 more read then, rather 
> than returning a 999ms wait interval indicating the next full refill time, 
> the rate limiter will recognize that you only need to wait 99ms before 1 read 
> can be available. After 100ms has passed in aggregate since the last full 
> refill, it will support the refilling of 1/10th the limit to facilitate the 
> request for 1/10th the resources.
> h3. The Problems with Current RateLimiters
> The problem with the fixed interval rate limiter is that it is too strict 
> from a latency perspective. It results in quota limits to which we cannot 
> fully subscribe with any consistency.
> The problem with the average interval rate limiter is that, in practice, it 
> is far too optimistic. For example, a real rate limiter might limit to 
> 100MB/sec of read IO per machine. Any multigets that come in will require 
> only a tiny fraction of this limit; for example, a 64kb block is only 0.06% 
> of the total. As a result, the vast majority of wait intervals end up being 
> tiny — like <5ms. This can actually cause an inverse of your intention, where 
> setting up a throttle causes a DDOS of your RPC layer via continuous 
> throttling and ~immediate retrying. I've discussed this problem in 
> https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait 
> interval as the solution there; after some more thinking, I believe this new 
> rate limiter would be a less hacky solution to this deficit so I'd like to 
> close that Jira in favor of this one.
> See the attached chart where I put in place a 10k req/sec/machine throttle 
> for this user at 10:43 to try to curb this high traffic, and it resulted in a 
> huge spike of req/sec due to the throttle/retry loop created by the 
> AverageIntervalRateLimiter.
> h3. Original Proposal: PartialIntervalRateLimiter as a Solution
> I've implemented a RateLimiter which allows for partial chunks of the overall 
> interval to be refilled, by default these chunks are 10% (or 100ms of a 1s 
> interval). I've deployed this to a test cluster at my day job and have seen 
> this really help our ability to full subscribe to a quota limit without 
> executing superfluous retries. See the other attached chart which shows a 
> cluster undergoing a rolling restart from using FixedIntervalRateLimiter to 
> my new PartialIntervalRateLimiter and how it is then able to fully subscribe 
> to its allotted 25MB/sec/machine read IO quota.
> h3. Updated Proposal: Improving FixedIntervalRateLimiter
> Rather than implement a new rate limiter, we can make a lower touch change 
> which just adds support for a refill interval that is less than the time unit 
> on a FixedIntervalRateLimiter. This can be a no-op change for those who have 
> not opted into the feature by having the refill interval default to the time 
> unit. For clarity, see [my branch 
> here|https://github.com/apache/hbase/compare/master...HubSpot:hbase:HBASE-28453]
>  which I will PR soon



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-28453) Support a middle ground between the Average and Fixed interval rate limiters

Reply via email to