Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Alexandre Dupriez Fri, 12 Jun 2020 02:58:24 -0700

Hi Gokul,

Thank you for the answers and the data provided to illustrate the use case.
A couple of additional questions.


904. If multi-tenancy is addressed in a future KIP, how smooth would
be the upgrade path? For example, the introduced configuration
parameters still apply, right? We would still maintain a first-come
first-served pattern when topics are created?

905. The current built-in assignment tool prioritises balance between
racks over brokers. In the version you propose, the limit on partition
count would take precedence over attempts to balance between racks.
Could it lead to a situation where it results in all partitions of a
topic being assigned in a single data center, if brokers in other
racks are "full"? Since it can potentially weaken the availability
guarantees for that topic (and maybe durability and/or consumer
performance with additional cross-rack traffic), how would we want to
handle the case? It may be worth warning users that the resulting
guarantees differ from what is offered by an "unlimited" assignment
plan in such cases? Also, let's keep in mind that some plans generated
by existing rebalancing tools could become invalid (w.r.t to the
configured limits).

906. The limits do not apply to internal topics. What about
framework-level topics from other tools and extensions? (connect,
streams, confluent metrics, tiered storage, etc.) Is blacklisting
possible?

907. What happens if one of the dynamic limit is violated at update
time? (sorry if it's already explained in the KIP, may have missed it)

Thanks,
Alexandre

Le dim. 3 mai 2020 à 20:20, Gokul Ramanan Subramanian
<gokul24...@gmail.com> a écrit :
>
> Thanks Stanislav. Apologies about the long absence from this thread.
>
> I would prefer having per-user max partition limits in a separate KIP. I
> don't see this as an MVP for this KIP. I will add this as an alternative
> approach into the KIP.
>
> I was in a double mind about whether or not to impose the partition limit
> for internal topics as well. I can be convinced both ways. On the one hand,
> internal topics should be purely internal i.e. users of a cluster should
> not have to care about them. In this sense, the partition limit should not
> apply to internal topics. On the other hand, Kafka allows configuring
> internal topics by specifying their replication factor etc. Therefore, they
> don't feel all that internal to me. In any case, I'll modify the KIP to
> exclude internal topics.
>
> I'll also add to the KIP the alternative approach Tom suggested around
> using topic policies to limit partitions, and explain why it does not help
> to solve the problem that the KIP is trying to address (as I have done in a
> previous correspondence on this thread).
>
> Cheers.
>
> On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <stanis...@confluent.io>
> wrote:
>
> > Thanks for the KIP, Gokul!
> >
> > I like the overall premise - I think it's more user-friendly to have
> > configs for this than to have users implement their own config policy -> so
> > unless it's very complex to implement, it seems worth it.
> > I agree that having the topic policy on the CreatePartitions path makes
> > sense as well.
> >
> > Multi-tenancy was a good point. It would be interesting to see how easy it
> > is to extend the max partition limit to a per-user basis. Perhaps this can
> > be done in a follow-up KIP, as a natural extension of the feature.
> >
> > I'm wondering whether there's a need to enforce this on internal topics,
> > though. Given they're internal and critical to the function of Kafka, I
> > believe we'd rather always ensure they're created, regardless if over some
> > user-set limit. It brings up the question of forward compatibility - what
> > happens if a user's cluster is at the maximum partition capacity, yet a new
> > release of Kafka introduces a new topic (e.g KIP-500)?
> >
> > Best,
> > Stanislav
> >
> > On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> > gokul24...@gmail.com> wrote:
> >
> > > Hi Tom.
> > >
> > > With KIP-578, we are not trying to model the load on each partition, and
> > > come up with an exact limit on what the cluster or broker can handle in
> > > terms of number of partitions. We understand that not all partitions are
> > > equal, and the actual load per partition varies based on the message
> > size,
> > > throughput, whether the broker is a leader for that partition or not etc.
> > >
> > > What we are trying to achieve with KIP-578 is to disallow a pathological
> > > number of partitions that will surely put the cluster in bad shape. For
> > > example, in KIP-578's appendix, we have described a case where we could
> > not
> > > delete a topic with 30k partitions, because the brokers could not
> > > handle all the work that needed to be done. We have also described how
> > > a producer performance test with 10k partitions observed basically 0
> > > throughput. In these cases, having a limit on number of partitions
> > > would allow the cluster to produce a graceful error message at topic
> > > creation time, and prevent the cluster from entering a pathological
> > state.
> > > These are not just hypotheticals. We definitely see many of these
> > > pathological cases happen in production, and we would like to avoid them.
> > >
> > > The actual limit on number of partitions is something we do not want to
> > > suggest in the KIP. The limit will depend on various tests that owners of
> > > their clusters will have to perform, including perf tests, identifying
> > > topic creation / deletion times, etc. For example, the tests we did for
> > the
> > > KIP-578 appendix were enough to convince us that we should not have
> > > anywhere close to 10k partitions on the setup we describe there.
> > >
> > > What we want to do with KIP-578 is provide the flexibility to set a limit
> > > on number of partitions based on tests cluster owners choose to perform.
> > > Cluster owners can do the tests however often they wish and dynamically
> > > adjust the limit on number of partitions. For example, we found in our
> > > production environment that we don't want to have more than 1k partitions
> > > on an m5.large EC2 broker instances, or more than 300 partitions on a
> > > t3.medium EC2 broker, for typical produce / consume use cases.
> > >
> > > Cluster owners are free to not configure the limit on number of
> > partitions
> > > if they don't want to spend the time coming up with a limit. The limit
> > > defaults to INT32_MAX, which is basically infinity in this context, and
> > > should be practically backwards compatible with current behavior.
> > >
> > > Further, the limit on number of partitions should not come in the way of
> > > rebalancing tools under normal operation. For example, if the partition
> > > limit per broker is set to 1k, unless the number of partitions comes
> > close
> > > to 1k, there should be no impact on rebalancing tools. Only when the
> > number
> > > of partitions comes close to 1k, will rebalancing tools be impacted, but
> > at
> > > that point, the cluster is already at its limit of functioning (per some
> > > definition that was used to set the limit in the first place).
> > >
> > > Finally, I want to end this long email by suggesting that the partition
> > > assignment algorithm itself does not consider the load on various
> > > partitions before assigning partitions to brokers. In other words, it
> > > treats all partitions as equal. The idea of having a limit on number of
> > > partitions is not mis-aligned with this tenet.
> > >
> > > Thanks.
> > >
> > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tbent...@redhat.com> wrote:
> > >
> > > > Hi Gokul,
> > > >
> > > > the partition assignment algorithm needs to be aware of the partition
> > > > > limits.
> > > > >
> > > >
> > > > I agree, if you have limits then anything doing reassignment would need
> > > > some way of knowing what they were. But the thing is that I'm not
> > really
> > > > sure how you would decide what the limits ought to be.
> > > >
> > > >
> > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > > partitions on each broker enforced via a configurable policy class
> > (the
> > > > one
> > > > > you recommended). While the policy class may accept a topic creation
> > > > > request for 11 partitions with a replication factor of 2 each
> > (because
> > > it
> > > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > know
> > > > not
> > > > > to assign the 11th partition to broker 3 because it would run out of
> > > > > partition capacity otherwise.
> > > > >
> > > >
> > > > I know this is only a toy example, but I think it also serves to
> > > illustrate
> > > > my point above. How has a limit of 40 partitions been arrived at? In
> > real
> > > > life different partitions will impart a different load on a broker,
> > > > depending on all sorts of factors (which topics they're for, the
> > > throughput
> > > > and message size for those topics, etc). By saying that a broker should
> > > not
> > > > have more than 40 partitions assigned I think you're making a big
> > > > assumption that all partitions have the same weight. You're also
> > limiting
> > > > the search space for finding an acceptable assignment. Cluster
> > balancers
> > > > usually use some kind of heuristic optimisation algorithm for figuring
> > > out
> > > > assignments of partitions to brokers, and it could be that the best (or
> > > at
> > > > least a good enough) solution requires assigning the least loaded 41
> > > > partitions to one broker.
> > > >
> > > > The point I'm trying to make here is whatever limit is chosen it's
> > > probably
> > > > been chosen fairly arbitrarily. Even if it's been chosen based on some
> > > > empirical evidence of how a particular cluster behaves it's likely that
> > > > that evidence will become obsolete as the cluster evolves to serve the
> > > > needs of the business running it (e.g. some hot topic gets
> > repartitioned,
> > > > messages get compressed with some new algorithm, some new topics need
> > to
> > > be
> > > > created). For this reason I think the problem you're trying to solve
> > via
> > > > policy (whether that was implemented in a pluggable way or not) is
> > really
> > > > better solved by automating the cluster balancing and having that
> > cluster
> > > > balancer be able to reason about when the cluster has too few brokers
> > for
> > > > the number of partitions, rather than placing some limit on the sizing
> > > and
> > > > shape of the cluster up front and then hobbling the cluster balancer to
> > > > work within that.
> > > >
> > > > I think it might be useful to describe in the KIP how users would be
> > > > expected to arrive at values for these configs (both on day 1 and in an
> > > > evolving production cluster), when this solution might be better than
> > > using
> > > > a cluster balancer and/or why cluster balancers can't be trusted to
> > avoid
> > > > overloading brokers.
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > >
> > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > > gokul24...@gmail.com> wrote:
> > > >
> > > > > This is good reference Tom. I did not consider this approach at all.
> > I
> > > am
> > > > > happy to learn about it now.
> > > > >
> > > > > However, I think that partition limits are not "yet another" policy
> > > > > configuration. Instead, they are fundamental to partition assignment.
> > > > i.e.
> > > > > the partition assignment algorithm needs to be aware of the partition
> > > > > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and
> > > 3),
> > > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > > partitions on each broker enforced via a configurable policy class
> > (the
> > > > one
> > > > > you recommended). While the policy class may accept a topic creation
> > > > > request for 11 partitions with a replication factor of 2 each
> > (because
> > > it
> > > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > know
> > > > not
> > > > > to assign the 11th partition to broker 3 because it would run out of
> > > > > partition capacity otherwise.
> > > > >
> > > > > To achieve the ideal end that you are imagining (and I can totally
> > > > > understand where you are coming from vis-a-vis the extensibility of
> > > your
> > > > > solution wrt the one in the KIP), that would require extracting the
> > > > > partition assignment logic itself into a pluggable class, and for
> > which
> > > > we
> > > > > could provide a custom implementation. I am afraid that would add
> > > > > complexity that I am not sure we want to undertake.
> > > > >
> > > > > Do you see sense in what I am saying?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tbent...@redhat.com>
> > > wrote:
> > > > >
> > > > > > Hi Gokul,
> > > > > >
> > > > > > Leaving aside the question of how Kafka scales, I think the
> > proposed
> > > > > > solution, limiting the number of partitions in a cluster or
> > > per-broker,
> > > > > is
> > > > > > a policy which ought to be addressable via the pluggable policies
> > > (e.g.
> > > > > > create.topic.policy.class.name). Unfortunately although there's a
> > > > policy
> > > > > > for topic creation, it's currently not possible to enforce a policy
> > > on
> > > > > > partition increase. It would be more flexible to be able enforce
> > this
> > > > > kind
> > > > > > of thing via a pluggable policy, and it would also avoid the
> > > situation
> > > > > > where different people each want to have a config which addresses
> > > some
> > > > > > specific use case or problem that they're experiencing.
> > > > > >
> > > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > > policies
> > > > > > being easily circumvented, but it didn't really make any progress.
> > > I've
> > > > > > looked at it again in some detail more recently and I think
> > something
> > > > > might
> > > > > > be possible following the work to make all ZK writes happen on the
> > > > > > controller.
> > > > > >
> > > > > > Of course, this is just my take on it.
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > > gokul24...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > For the sake of expediting the discussion, I have created a
> > > prototype
> > > > > PR:
> > > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and)
> > > when
> > > > > the
> > > > > > > KIP is accepted, I'll modify this to add the full implementation
> > > and
> > > > > > tests
> > > > > > > etc. in there.
> > > > > > >
> > > > > > > Would appreciate if a Kafka committer could share their thoughts,
> > > so
> > > > > > that I
> > > > > > > can more confidently start the voting thread.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > > > gokul24...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks for your comments Alex.
> > > > > > > >
> > > > > > > > The KIP proposes using two configurations max.partitions and
> > > > > > > > max.broker.partitions. It does not enforce their use. The
> > default
> > > > > > values
> > > > > > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > > > > > >
> > > > > > > > In multi-tenant environments and in partition assignment and
> > > > > > rebalancing,
> > > > > > > > the admin could (a) use the default values which would yield
> > > > similar
> > > > > > > > behavior to now, (b) set very high values that they know is
> > > > > sufficient,
> > > > > > > (c)
> > > > > > > > dynamically re-adjust the values should the business
> > requirements
> > > > > > change.
> > > > > > > > Note that the two configurations are cluster-wide, so they can
> > be
> > > > > > updated
> > > > > > > > without restarting the brokers.
> > > > > > > >
> > > > > > > > The quota system in Kafka seems to be geared towards limiting
> > > > traffic
> > > > > > for
> > > > > > > > specific clients or users, or in the case of replication, to
> > > > leaders
> > > > > > and
> > > > > > > > followers. The quota configuration itself is very similar to
> > the
> > > > one
> > > > > > > > introduced in this KIP i.e. just a few configuration options to
> > > > > specify
> > > > > > > the
> > > > > > > > quota. The main difference is that the quota system is far more
> > > > > > > > heavy-weight because it needs to be applied to traffic that is
> > > > > flowing
> > > > > > > > in/out constantly. Whereas in this KIP, we want to limit number
> > > of
> > > > > > > > partition replicas, which gets modified rarely by comparison
> > in a
> > > > > > typical
> > > > > > > > cluster.
> > > > > > > >
> > > > > > > > Hope this addresses your comments.
> > > > > > > >
> > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > > alexandre.dupr...@gmail.com> wrote:
> > > > > > > >
> > > > > > > >> Hi Gokul,
> > > > > > > >>
> > > > > > > >> Thanks for the KIP.
> > > > > > > >>
> > > > > > > >> From what I understand, the objective of the new configuration
> > > is
> > > > to
> > > > > > > >> protect a cluster from an overload driven by an excessive
> > number
> > > > of
> > > > > > > >> partitions independently from the load handled on the
> > partitions
> > > > > > > >> themselves. As such, the approach uncouples the data-path load
> > > > from
> > > > > > > >> the number of unit of distributions of throughput and intends
> > to
> > > > > avoid
> > > > > > > >> the degradation of performance exhibited in the test results
> > > > > provided
> > > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > > >>
> > > > > > > >> Couple of comments:
> > > > > > > >>
> > > > > > > >> 900. Multi-tenancy - one concern I would have with a cluster
> > and
> > > > > > > >> broker-level configuration is that it is possible for a user
> > to
> > > > > > > >> consume a large proportions of the allocatable partitions
> > within
> > > > the
> > > > > > > >> configured limit, leaving other users with not enough
> > partitions
> > > > to
> > > > > > > >> satisfy their requirements.
> > > > > > > >>
> > > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > > upper-bound
> > > > > on
> > > > > > > >> resource consumptions is via client/user quotas. Could this
> > > > > framework
> > > > > > > >> be leveraged to add this limit?
> > > > > > > >>
> > > > > > > >> 902. Partition assignment - one potential problem with the new
> > > > > > > >> repartitioning scheme is that if a subset of brokers have
> > > reached
> > > > > > > >> their number of assignable partitions, yet their data path is
> > > > > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > > > > >> exclusively to other brokers, which could increase the
> > > likelihood
> > > > of
> > > > > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > > > > >> constraint on the number of partitions from the data-path
> > > > throughput
> > > > > > > >> can have conflicting requirements.
> > > > > > > >>
> > > > > > > >> 903. Rebalancing - as a corollary to 902, external tools used
> > to
> > > > > > > >> balance ingress throughput may adopt an incremental approach
> > in
> > > > > > > >> partition re-assignment to redistribute load, and could hit
> > the
> > > > > limit
> > > > > > > >> on the number of partitions on a broker when a (too)
> > > conservative
> > > > > > > >> limit is used, thereby over-constraining the objective
> > function
> > > > and
> > > > > > > >> reducing the migration path.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Alexandre
> > > > > > > >>
> > > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > > >> <gokul24...@gmail.com> a écrit :
> > > > > > > >> >
> > > > > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > > > > feedback.
> > > > > > > >> >
> > > > > > > >> > Thanks. Regards.
> > > > > > > >> >
> > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > > > > >> > gokul24...@gmail.com> wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi.
> > > > > > > >> > >
> > > > > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> > > > limit
> > > > > > the
> > > > > > > >> number
> > > > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback
> > on
> > > > the
> > > > > > KIP
> > > > > > > >> which
> > > > > > > >> > > you can find at
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > > >> > >
> > > > > > > >> > > I want to specially thank Stanislav Kozlovski who helped
> > in
> > > > > > > >> formulating
> > > > > > > >> > > some aspects of the KIP.
> > > > > > > >> > >
> > > > > > > >> > > Many thanks,
> > > > > > > >> > >
> > > > > > > >> > > Gokul.
> > > > > > > >> > >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Reply via email to