Yes, this is the drawback if do static partitioning.

Thanks,
Penghui


Kaushik Ghosh <[email protected]> 于2021年3月3日周三 上午9:21写道:

> Wouldn't such a static partitioning approach have the drawback that in a
> pathological case, all the namespaces associated with a certain
> namespace-bundle may be inactive (and not use their quota) while other
> namespaces are over-active and being restricted?
>
> Thanks,
> Kaushik
>
> On Tue, Mar 2, 2021 at 5:03 PM PengHui Li <[email protected]> wrote:
>
>> [ External sender. Exercise caution. ]
>>
>> The approach is sharing the quotas between brokers through an internal
>> topic, for example, if the rate limit is 100msgs/s and the current rate is
>> 50 msgs/s
>> . If share quotas between brokers, we still need to achieve the policy to
>> assign the remaining quotas to multiple brokers.
>>
>> How about assigning the quotas per namespace bundle? If set the publish
>> rate limit to 100msgs/s, and the namespace has 10 bundles, so we can assign
>> 10msgs/s per namespace bundle.
>> Since a bundle always assigned to one broker. So we don't need to share
>> the quotas.
>>
>> Just a rough idea. Instead of the share the quotas between brokers, I
>> know that each approach has advantages and disadvantages, and we have done
>> the broker publish buffer limitation by split the
>> the whole buffer into multiple parts by the iothread.
>>
>> Thanks,
>> Penghui
>>
>> Matteo Merli <[email protected]> 于2021年3月1日周一 下午1:17写道:
>>
>>>
>>> https://github.com/apache/pulsar/wiki/PIP-82%3A-Tenant-and-namespace-level-rate-limiting
>>>
>>> =============
>>>
>>>
>>> * **Status**: Proposal
>>> * **Authors**: Bharani Chadalavada, Kaushik Ghosh, Ravi Vaidyanathan,
>>> Matteo Merli
>>> * **Pull Request**:
>>> * **Mailing List discussion**:
>>> * **Release**:
>>>
>>> ## Motivation
>>>
>>> Currently in Pulsar, it is possible to configure rate limiting, in
>>> terms of messages/sec or bytes/sec both on the producers or the
>>> consumers for a topic. The rates are configured in the namespace
>>> policies and the enforcement is done at the topic level, or at the
>>> partition level, in the case of a partitioned topic.
>>>
>>> The fact that rate is enforced at topic level doesn’t allow to control
>>> the max rate across a given namespace (a namespace can span multiple
>>> brokers). For example if the limit is 100msg/s per topic, a user can
>>> simply create more topics to keep increasing the load on the system.
>>>
>>> Instead, we should have a way to better define producers and consumers
>>> limit for a namespace or a Pulsar tenant and have the Pulsar brokers
>>> to collectively enforce them.
>>>
>>> ## Goal
>>>
>>> The goal for this feature is to allow users to configure a namespace
>>> or tenant wide limit for producers and consumers and have that
>>> enforced irrespective of the number of topics in the namespace, with
>>> fair sharing of the quotas.
>>>
>>> Another important aspect is that the quota enforcement needs to be
>>> able to dynamically adjust when the quota is raised or reduced.
>>>
>>> ### Non-goals
>>>
>>> It is not a goal to provide a super strict limiter, rather the
>>> implementation would be allowed to either undercount or overcount for
>>> short amounts of time, as long as the limiting converges close to the
>>> configured quota, with an approximation of, say, 10%.
>>>
>>> It is not a goal to allow users to configure limits at multiple levels
>>> (tenant/namespace/topic) and implement a hierarchical enforcement
>>> mechanism.
>>>
>>> If the limits are configured at tenant level, it is not a goal to
>>> evenly distribute the quotas across all namespaces. Similarly if the
>>> limits are configured at namespace level, it is not a goal to evenly
>>> distribute the quota across all topics in the namespace.
>>>
>>>
>>> ## Implementation
>>>
>>> ### Configuration of quotas
>>>
>>> In order to implement the rate limiting per namespace or tenant, we’re
>>> going to introduce the concept of a “ResourceGroup”. A ResourceGroup
>>> is defined as the grouping of different rate limit quotas and it can
>>> be associated with different resources, for example a Pulsar tenant or
>>> a Pulsar namespace to start with.
>>>
>>> In addition to rate limiting (in bytes/s and msg/s), for producers and
>>> consumers, the configuration of the ResourceGroup might also contain
>>> additional quotas in the future, such as the storage quota, although
>>> that is outside the scope of this current proposal.
>>>
>>> ### Enforcement
>>>
>>> In order to enforce the limit over several topics that belong to a
>>> particular namespace, we need to have multiple brokers to cooperate
>>> with each other with a feedback mechanism. With this each broker will
>>> be able to know, within the scope of a particular ResourceGroup, how
>>> much of the portion of the quota is currently being used by other
>>> brokers.
>>>
>>> Each broker will then make sure that the available quota is split
>>> optimally between the brokers who are requesting it.
>>>
>>> Note: Pulsar currently supports topic/partition level rate-limiting,
>>> if that is configured along with the new namespace wide rate-limiting
>>> using resource groups then both configurations will be effective. In
>>> effect, at the broker level the old config will be enforced and also
>>> the namespace level rate-limiter will be enforced, so the more
>>> stringent of the two will get enforced.
>>>
>>> At some point in the future it will be good to make topic/partition
>>> quota configuration to fit within the namespace level ratelimiter and
>>> more self-explanatory. At that point the old configuration could be
>>> deprecated over time, Not in the scope of this feature though.
>>>
>>>
>>> #### Communications between brokers
>>>
>>> Brokers will be talking to each other using a regular Pulsar
>>> topic. For the purposes of this feature, a non-persistent topic will
>>> be the ideal choice to have minimum resources requirement and always
>>> giving the last data value. We can mostly ignore the data losses as
>>> part of an “undercounting” event which will lead to exceed the quota
>>> for a brief amount of time.
>>>
>>> Each broker will publish the current actual usage, as an absolute
>>> number, for each of the ResourceGroups that are currently having
>>> traffic, and for which the traffic has changed significantly since the
>>> last time it was reported (eg: ±10%). Each broker will also use these
>>> updates to keep track of which brokers are communicating on various
>>> ResourceGroups; hence, each broker that is active on a ResourceGroups
>>> will mandatorily report its usage once in N cycles (value of N may be
>>> configurable), even if the traffic has not changed significantly.
>>>
>>> The update will be in the form of a ProtocolBuffer message published
>>> on the internal topic. The format of the update will be like:
>>>
>>> ```
>>> {
>>>    broker : “broker-1.example.com”,
>>>    usage : {
>>>       “tenant-1/ns1” : {
>>>          topics: 1,
>>>          publishedMsg : 100,
>>>          publishedBytes : 100000,
>>>       },
>>>       “tenant-1/ns2” : {
>>>          topics: 1,
>>>          publishedMsg : 1000,
>>>          publishedBytes : 500000,
>>>       },
>>>       “tenant-2” : {
>>>          topics: 1,
>>>          publishedMsg : 80000,
>>>          publishedBytes : 9999999,
>>>       },
>>>    }
>>> }
>>> ```
>>>
>>> Each broker will use a Pulsar reader on the topic and will receive
>>> every update from other brokers. These updates will get inserted into
>>> a hash map:
>>>
>>> ```
>>> Map<ResourceGroup, Map<BrokerName, Usage>>
>>> ```
>>>
>>> With this, each broker will be aware of the actual usage done by each
>>> broker on the particular resource group. It will then proceed to
>>> adjust the rate on a local in-memory rate limiter, in the same way
>>> we’re currently doing the per-topic rate limiting.
>>>
>>> Example of usage distribution for a given ResourceGroup with a quota
>>> of 100. Let’s assume that the quota-assignment of 100 to this
>>> ResourceGroup is known to all the brokers (through configuration not
>>> shown here).
>>>
>>>  * broker-1: 10
>>>  * broker-2:  50
>>>  * broker-3: 30
>>>
>>> In this case, each broker will adjust their own local limits to
>>> utilize the remaining 10 units. They might each split up the remaining
>>> portion, each adding the remaining 10 units:
>>>
>>>  * broker-1 : 20
>>>  * broker-2: 60
>>>  * broker-3: 40
>>>
>>> In the short term, this will lead to passing the set quota, but it
>>> will quickly converge in just a few cycles to the fair values.
>>>
>>> Alternatively, each broker may split up the 10 units proportionally,
>>> based on historic usage (so they can use 1/9th, 5/9ths, and 1/3rd of
>>> the residual 10 units).
>>>
>>>  * broker-1 : 11.11
>>>  * broker-2: 55.56
>>>  * broker-3: 33.33
>>>
>>> The opposite would happen (each broker would reduce its usage by the
>>> corresponding fractional amount) if the recent-usage was over the
>>> quota assigned on the resource-group.
>>>
>>> In a similar way, brokers will try to “steal” part of the quota when
>>> there is another broker using a bigger portion. For example, consider
>>> the following usage report map:
>>>
>>>  * broker-1: 80
>>>  * broker-2:  20
>>>
>>> Broker-2 has the rate limiter set to 20 and that also reflects the
>>> actual usage and therefore could just mean that broker-2 is unfairly
>>> throttled. Since broker-1 is dominant in the usage map, broker-2 will
>>> set the local limiter to a value that is higher than 20, for example
>>> half-way to the next broker, in this case to `20 + (80 - 20)/2 - 50`.
>>>
>>> If indeed, broker-2 has more demand for traffic, that will increase
>>> broker-2 usage to 30 in the next update and it consequently trigger
>>> broker-1 to reduce its limit to 70. This step-by-step will continue
>>> until it converges to the equilibrium point.
>>>
>>> Generalizing it for the N brokers case, the broker with the lowest
>>> quota will steal part of the quota of the most dominant broker. Broker
>>> with second lowest quota will try to steal part of the quota of the
>>> second dominant broker and so on till all brokers converge to the
>>> equilibrium point.
>>>
>>> #### Goal
>>>
>>> Whenever an event that influences the quota allocation
>>> (broker/producer/consumer joins or leaves) occurs, the quota
>>> adjustment step function needs to converge the quotas to stable
>>> allocations in minimum number of iterations, while also ensuring that:
>>>
>>>  * The adjustment curve should be smooth instead of being jagged.
>>>  * The quota is not under-utilized.
>>>    - For example if the quota is 100 and there are two brokers and
>>>       broker-1 is allocated 70, broker-2 is allocated 30. If
>>>       broker-1's usage is 80 and broker-2's usage is 20 we need to
>>>       ensure the design does not lead to under-utilization
>>>  * Fairness of quota allocation across brokers.
>>>    - If quota is 100 both brokers are seeing a uniform load of say 70,
>>>      but one broker is allocated 70 and the other is allocated 30.
>>>
>>>
>>> #### Clearing up stale broker entries
>>>
>>> Brokers are only sending updates in the common topic if there are
>>> significant changes, or if they have not reported for a (configurable)
>>> number of rounds due to unchanged usage. This is to minimize the
>>> amount of traffic in the common topic and work to be done to process
>>> them.
>>>
>>> When a broker publishes an update with a quota of 0, everyone will
>>> remove that broker from the usage map. In the same way, when brokers
>>> detect that one broker went down, through the ZooKeeper registration,
>>> it will be clearing that broker from all the usage maps.
>>>
>>> #### Rate limiting across topics/partitions
>>>
>>> For each tenant/namespace that a broker is managing, the usage
>>> reported by it is the aggregate of usages for all
>>> topics/partitions. Therefore, the quota adjustment function will
>>> divide the quota proportionally (taking the usages reported by other
>>> brokers into consideration). And within the quota allocated to the
>>> broker, it can choose to either sub-divide it evenly across it’s own
>>> topics/partitions or sub-divide it proportional to the usages of each
>>> topic/partition.
>>>
>>>
>>> ### Resource consumption considerations
>>>
>>> With the proposed implementation, each broker will keep a full map of
>>> all the resource groups and their usage, broker by broker.  The amount
>>> of memory/bandwidth consumption will depend on several factors such as
>>> number of namespaces, brokers etc. Below is an approximate estimate
>>> for one type of scaled scenario [where the quotas are enforced at a
>>> namespace level].
>>>
>>> In this scenario, let’s consider:
>>>  * 100000 namespaces
>>>  * 100 brokers.
>>>
>>> Each namespace is spread across 5 brokers. So, each broker is managing
>>> 5000 namespaces.
>>>
>>> #### Memory
>>>
>>> For a given namespace, each broker stores usage from 5 other brokers
>>> (including itself).
>>>
>>>  * Size of usage = 16 bytes (bytes+messages)
>>>  * Size of usage for publish+consume = 32bytes
>>>  * For one namespace, usage of 5 brokers = 32*5
>>>  * 5000 namespaces = 32*5*5000 = 800K
>>>
>>> Meta-data overhead [Assuming that namespace name is about 80 bytes and
>>> broker name is 40 bytes]: `5000*80 + 5 * 40 = 400K bytes`.
>>>
>>> Total memory requirement ~= 1MB.
>>>
>>> #### Bandwidth
>>>
>>> Each broker sends the usage for the namespaces that it manages.
>>>
>>>  * Size of usage = 16 bytes
>>>  * Size of usage for publish+consume = 32bytes
>>>  * For 5000 namespaces, each broker publishes periodically (say every
>>>    minute): 32*5000 = 160K bytes.
>>>  * Metadata overhead [assuming broker name is 40 bytes and namespace
>>>    is 80 bytes]: 5000*80 = 400K.
>>>  * For 100 brokers: (160K + 400K) * 100 = 56MB.
>>>
>>> So, publish side network bandwidth is 56MB.
>>> Including the consumption side (across all brokers), it is 56MB*100 =
>>> 5.6G.
>>>
>>> Few optimizations that can reduce the overall bandwidth:
>>>
>>>  * Brokers publish usage only if the usage is significantly different
>>>    from the previous update.
>>>  * Use compression (trade-off is higher CPU).
>>>  * Publish to more than one system topic (so the network load gets
>>>    distributed across brokers).
>>>  * Since metadata changes [ namespace/tenant addition/removal ] are
>>>    rare, publish the metadata update [namespace name to ID mapping]
>>>    only when there is a change. The Usage report will carry the
>>>    namespace/tenant ID instead of the name.
>>>
>>>
>>> #### Persistent storage
>>>
>>> The usage data doesn’t require persistent storage. So, there is no
>>> persistent storage overhead.
>>>
>>>
>>> ## Alternative approaches
>>>
>>> ### External database
>>>
>>> One approach is to use a distributed DB (such as Cassandra, redis,
>>> memcached etc) to store the quota counter and usage counters. Quota
>>> reservation/refresh can just be increment/decrement operations on the
>>> quota counter. The approach may seem reasonable, but has a few issues:
>>>
>>>  * Atomic increment/decrement operations on the distributed counter
>>>    can incur significant latency overhead.
>>>  * Dependency on external systems has a very high operational cost. It
>>>    is yet another cluster that needs to be deployed, maintained,
>>>    updated etc.
>>>
>>> ### Centralized implementation
>>>
>>> One broker is designated as leader for a given resource group and the
>>> quota allocation is computed by that broker. The allocation is then
>>> disseminated to other brokers. “Exclusive producer” feature can be
>>> used for this purpose. This approach has a few issues.
>>>
>>>  * More complex implementation because of leader election.
>>>  * If the leader dies, another needs to be elected which can lead to
>>>    unnecessary latency.
>>>
>>> ### Using Zookeeper
>>>
>>> Another possible approach would have been for the brokers to exchange
>>> usage information through Zookeeper. This was not pursued because of
>>> perceived scaling issues: with a large number of brokers and/or
>>> namespaces/tenants to rate-limit, the size of messages and the high
>>> volume of writes into Zookeeper could become a problem. Since
>>> Zookeeper is already used in certain parts of Pulsar, it was decided
>>> that we should not burden that subsystem with more work.
>>>
>>>
>>>
>>>
>>> --
>>> Matteo Merli
>>> <[email protected]>
>>>
>>

Reply via email to