Re: [DISCUSS] KIP-612: Ability to Limit Connection Creation Rate on Brokers

Anna Povzner Mon, 18 May 2020 16:51:18 -0700

I realized the KIP freeze is on May 20. I will start the voting thread now.


On Mon, May 18, 2020 at 3:19 PM Anna Povzner <a...@confluent.io> wrote:

> Thanks everyone for the feedback. I will start a voting thread tomorrow
> morning if there are no more comments.
>
> Regards,
> Anna
>
> On Mon, May 18, 2020 at 2:06 PM Anna Povzner <a...@confluent.io> wrote:
>
>> Hi Boyang,
>>
>> This KIP does not change the protocol with clients. The behavior is the
>> same as with KIP-402 where the broker delays accepting new connections when
>> the limit for the number of connections is reached. This KIP adds another
>> reason for the delay (when the rate is reached). Similarly, when dropping a
>> connection when per-IP limit is reached, except this KIP delays the
>> response or may still accept the connection. Client may timeout on waiting
>> for connection, and retry.
>>
>> Thanks,
>> Anna
>>
>> On Mon, May 18, 2020 at 12:54 PM Boyang Chen <reluctanthero...@gmail.com>
>> wrote:
>>
>>> Hey Anna,
>>>
>>> thanks for the KIP. Will this change be applied as one type of quota
>>> violation, which for client side should be retriable? For EOS model
>>> before
>>> 2.6, the Streams client creates one producer for each input partition, so
>>> it is actually possible to create thousands of producers when the service
>>> is up. Just want to clarify what's the expected behavior to be seen on
>>> the
>>> client side?
>>>
>>> On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote:
>>>
>>> > Hi Alexandre,
>>> >
>>> > Thanks for your comments. My answers are below:
>>> >
>>> > 900. The KIP does not propose any new metrics because we already have
>>> > metrics that will let us monitor connection attempts and the amount of
>>> time
>>> > the broker delays accepting new connections:
>>> > 1. We have a per-listener (and per-processor) metric for connection
>>> > creation rate:
>>> >
>>> >
>>> kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate
>>> > 2. We have per-listener metrics that track the amount of time Acceptor
>>> is
>>> > blocked from accepting connections:
>>> >
>>> >
>>> kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name}
>>> > Note that adding per IP JMX metrics may end up adding a lot of
>>> overhead,
>>> > especially for clusters with a large number of clients and many
>>> different
>>> > IP addresses. If we ever want to add the metric, perhaps we could
>>> propose a
>>> > separate KIP, but that would require some more evaluation of potential
>>> > overhead.
>>> >
>>> > 901. Yes, I updated the wiki with the approach for enforcing per IP
>>> limits
>>> > (not dropping right away), as I described in my response to Rajini.
>>> >
>>> > 902. Any additional stress testing is always super useful. I am going
>>> to
>>> > have PR with the first half of the KIP ready soon (broker-wider and
>>> > per-listener limits). Perhaps it could be worthwhile to see if it makes
>>> > sense to add stress testing to muckrake tests. Also, check out
>>> connection
>>> > stress workloads in Trogdor and whether they are sufficient or could be
>>> > extended:
>>> >
>>> >
>>> https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload
>>> >
>>> > Regards,
>>> > Anna
>>> >
>>> > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram <
>>> rajinisiva...@gmail.com>
>>> > wrote:
>>> >
>>> > > Hi Anna,
>>> > >
>>> > > Thanks for the response, sounds good.
>>> > >
>>> > > Regards,
>>> > >
>>> > > Rajini
>>> > >
>>> > >
>>> > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io>
>>> wrote:
>>> > >
>>> > > > Hi Rajini,
>>> > > >
>>> > > > Thanks for reviewing the KIP!
>>> > > >
>>> > > > I agree with your suggestion to make per-IP connection rate quota a
>>> > > dynamic
>>> > > > quota for entity name IP. This will allow configuring connection
>>> rate
>>> > > for a
>>> > > > particular IP as well. I updated the wiki accordingly.
>>> > > >
>>> > > > Your second concern makes sense -- rejecting the connection right
>>> away
>>> > > will
>>> > > > likely cause a new connection from the same client. I am concerned
>>> > about
>>> > > > delaying new connections for processing later, because if the
>>> > connections
>>> > > > keep coming with the high rate, there may be potentially a large
>>> > backlog
>>> > > > and connections may start timing out before the broker gets to
>>> > processing
>>> > > > them. For example, if clients come through proxy, there may be
>>> > > > potentially a large number of incoming connections with the same
>>> IP.
>>> > > >
>>> > > > What do you think about the following option:
>>> > > > * Once per-IP connection rate reaches the limit, accept or drop
>>> (clean
>>> > > up)
>>> > > > the connection after a delay depending on whether the quota is
>>> still
>>> > > > violated. We could re-use the mechanism implemented with KIP-306
>>> where
>>> > > the
>>> > > > broker delays the response for failed client authentication. The
>>> delay
>>> > > will
>>> > > > be set to min(delay calculated based on the rate quota, 1 second),
>>> > which
>>> > > > matches the max delay for request quota.
>>> > > >
>>> > > > I think this option is somewhat your suggestion with delaying
>>> accepting
>>> > > per
>>> > > > IP connections that reached the rate limit, but with protection in
>>> > place
>>> > > to
>>> > > > make sure the number of delayed connections does not blow up. What
>>> do
>>> > you
>>> > > > think?
>>> > > >
>>> > > > Thanks,
>>> > > > Anna
>>> > > >
>>> > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez <
>>> > > > alexandre.dupr...@gmail.com> wrote:
>>> > > >
>>> > > > > Hi Anna,
>>> > > > >
>>> > > > > Thank you for your answers and explanations.
>>> > > > >
>>> > > > > A couple of additional comments:
>>> > > > >
>>> > > > > 900. KIP-612 does not intend to dedicate a metric to the
>>> throttling
>>> > of
>>> > > > > incoming connections. I wonder if such a metric would be handy
>>> for
>>> > > > > monitoring and help set-up metric-based alarming if one wishes to
>>> > > > > capture this type of incident?
>>> > > > >
>>> > > > > 901. Following-up on Rajini's point 2 above - from my
>>> understanding,
>>> > > > > this new quota should prevent excess CPU consumption in
>>> > > > > Processor#run() method when a new connection has been accepted.
>>> > > > > Through the throttling in place, connections will be delayed as
>>> > > > > indicated by the KIP's specifications:
>>> > > > >
>>> > > > > " If connection creation rate on the broker exceeds the
>>> broker-wide
>>> > > > > limit, the broker will delay accepting a new connection by an
>>> amount
>>> > > > > of time that brings the rate within the limit."
>>> > > > >
>>> > > > > You may be referring to the following sentence:
>>> > > > >
>>> > > > > "A new broker configuration option will be added to limit the
>>> rate at
>>> > > > > which connections will be accepted for each IP address. New
>>> > > > > connections for the IP will be dropped once the limit is
>>> reached."?
>>> > > > >
>>> > > > > 902. It may be interesting to capture the data with and without
>>> > > > > connection throttling under stress scenarios. You may have these
>>> data
>>> > > > > already. If you need a pair of hands to do some stress tests
>>> once you
>>> > > > > have a POC or a PR, I am happy to contribute :)
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Alexandre
>>> > > > >
>>> > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram <
>>> rajinisiva...@gmail.com
>>> > >
>>> > > a
>>> > > > > écrit :
>>> > > > > >
>>> > > > > > Hi Anna,
>>> > > > > >
>>> > > > > > Thanks for the KIP, looks good overall. A couple of comments
>>> about
>>> > > > per-IP
>>> > > > > > connection quotas:
>>> > > > > >
>>> > > > > > 1) Should we consider making per-IP quota similar to other
>>> quotas?
>>> > > > > > Configured as a dynamic quota for entity type IP, with per-IP
>>> limit
>>> > > as
>>> > > > > well
>>> > > > > > as defaults? Perhaps that would fit better rather than configs?
>>> > > > > >
>>> > > > > > 2) The current proposal drops connections after accepting
>>> > connections
>>> > > > for
>>> > > > > > per-IP limit. We do this in other cases too, but in this case,
>>> > should
>>> > > > we
>>> > > > > > throttle instead? My point is what is the quota protecting? If
>>> we
>>> > > want
>>> > > > to
>>> > > > > > limit rate of accepted connections, then accepting a
>>> connection and
>>> > > > then
>>> > > > > > dropping doesn't really help since that IP is going to
>>> reconnect.
>>> > If
>>> > > we
>>> > > > > > want to rate limit what happens next, i.e. authentication, then
>>> > > > > > throttling the accepted connection so its processing is delayed
>>> > would
>>> > > > > > perhaps be better?
>>> > > > > >
>>> > > > > > Regards,
>>> > > > > >
>>> > > > > > Rajini
>>> > > > > >
>>> > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot <
>>> dja...@confluent.io>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Hi Anna,
>>> > > > > > >
>>> > > > > > > Thanks for your answers and the updated KIP. Looks good to
>>> me!
>>> > > > > > >
>>> > > > > > > Best,
>>> > > > > > > David
>>> > > > > > >
>>> > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner <
>>> a...@confluent.io
>>> > >
>>> > > > > wrote:
>>> > > > > > >
>>> > > > > > > > I updated the KIP to add a new broker configuration to
>>> limit
>>> > > > > connection
>>> > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip.
>>> Once
>>> > > the
>>> > > > > limit
>>> > > > > > > > is reached for a particular IP address, the broker will
>>> reject
>>> > > the
>>> > > > > > > > connection from that IP (close the connection it accepted)
>>> and
>>> > > > > continue
>>> > > > > > > > rejecting them until the rate is back within the rate
>>> limit.
>>> > > > > > > >
>>> > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner <
>>> > a...@confluent.io
>>> > > >
>>> > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > Hi David and Alexandre,
>>> > > > > > > > >
>>> > > > > > > > > Thanks so much for your feedback! Here are my answers:
>>> > > > > > > > >
>>> > > > > > > > > 1. Yes, we have seen several cases of clients that
>>> create a
>>> > new
>>> > > > > > > > connection
>>> > > > > > > > > per produce/consume request. One hypothesis is someone
>>> who is
>>> > > > used
>>> > > > > to
>>> > > > > > > > > connection pooling may accidentally write a Kafka client
>>> that
>>> > > > > creates a
>>> > > > > > > > new
>>> > > > > > > > > connection every time.
>>> > > > > > > > >
>>> > > > > > > > > 2 & 4. That's a good point I haven't considered. I think
>>> it
>>> > > makes
>>> > > > > sense
>>> > > > > > > > to
>>> > > > > > > > > provide an ability to limit connection creations per IP
>>> as
>>> > > well.
>>> > > > > This
>>> > > > > > > is
>>> > > > > > > > > not hard to implement -- the broker already keeps track
>>> of
>>> > the
>>> > > > > number
>>> > > > > > > of
>>> > > > > > > > > connections per IP, and immediately closes a new
>>> connection
>>> > if
>>> > > it
>>> > > > > comes
>>> > > > > > > > > from an IP that reached the connection limit. So, we
>>> could
>>> > > > > additionally
>>> > > > > > > > > track the rate, and close the connection from IP that
>>> exceeds
>>> > > the
>>> > > > > rate.
>>> > > > > > > > One
>>> > > > > > > > > slight concern is whether keeping track of per IP rates
>>> and
>>> > > > quotas
>>> > > > > adds
>>> > > > > > > > > overhead (CPU and memory). But perhaps it is not a
>>> problem if
>>> > > we
>>> > > > > use
>>> > > > > > > > > expiring sensors.
>>> > > > > > > > >
>>> > > > > > > > > It would still make sense to limit the overall connection
>>> > > > creation
>>> > > > > rate
>>> > > > > > > > > for the Kafka clusters which are shared among many
>>> different
>>> > > > > > > > > applications/clients, since they may spike at the same
>>> time
>>> > > > > bringing
>>> > > > > > > the
>>> > > > > > > > > total rate too high.
>>> > > > > > > > >
>>> > > > > > > > > 3. Controlling connection queue sizes only controls the
>>> share
>>> > > of
>>> > > > > time
>>> > > > > > > > > network threads use for creating new connections (and
>>> > accepting
>>> > > > on
>>> > > > > > > > Acceptor
>>> > > > > > > > > thread) vs. doing other work on each Processor
>>> iteration. It
>>> > > does
>>> > > > > not
>>> > > > > > > > > directly control how processing connection creations
>>> would be
>>> > > > > related
>>> > > > > > > to
>>> > > > > > > > > other processing done by brokers like on request handler
>>> > > threads.
>>> > > > > So,
>>> > > > > > > > while
>>> > > > > > > > > controlling queue size may mitigate the issue for some
>>> of the
>>> > > > > > > workloads,
>>> > > > > > > > it
>>> > > > > > > > > does not guarantee that. Plus, if we want to limit how
>>> many
>>> > > > > connections
>>> > > > > > > > are
>>> > > > > > > > > created per IP, the queue size approach would not work,
>>> > unless
>>> > > we
>>> > > > > go
>>> > > > > > > > with a
>>> > > > > > > > > "share" of the queue, which I think even further obscures
>>> > what
>>> > > > that
>>> > > > > > > > setting
>>> > > > > > > > > means (and what we would achieve as an end result). Does
>>> this
>>> > > > > answer
>>> > > > > > > the
>>> > > > > > > > > question?
>>> > > > > > > > >
>>> > > > > > > > > If there are no objections, I will update the KIP to add
>>> per
>>> > IP
>>> > > > > > > > connection
>>> > > > > > > > > rate limits (config and enforcement).
>>> > > > > > > > >
>>> > > > > > > > > Thanks,
>>> > > > > > > > >
>>> > > > > > > > > Anna
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez <
>>> > > > > > > > > alexandre.dupr...@gmail.com> wrote:
>>> > > > > > > > >
>>> > > > > > > > >> Hello,
>>> > > > > > > > >>
>>> > > > > > > > >> Thank you for the KIP.
>>> > > > > > > > >>
>>> > > > > > > > >> I experienced in the past genuine broker brownouts due
>>> to
>>> > > > > connection
>>> > > > > > > > >> storms consuming most of the CPU available on the server
>>> > and I
>>> > > > > think
>>> > > > > > > > >> it is useful to protect against it.
>>> > > > > > > > >>
>>> > > > > > > > >> I tend to share the questions asked in points 2 and 4
>>> from
>>> > > > David.
>>> > > > > Is
>>> > > > > > > > >> there still a risk of denial of service if the limit
>>> applies
>>> > > at
>>> > > > > the
>>> > > > > > > > >> listener-level without differentiating between (an)
>>> > > “offending”
>>> > > > > > > > >> client(s) and the others?
>>> > > > > > > > >>
>>> > > > > > > > >> To rebound on point 3 - conceptually one difference
>>> between
>>> > > > > capping
>>> > > > > > > > >> the queue size or throttling as presented in the KIP
>>> would
>>> > > come
>>> > > > > from
>>> > > > > > > > >> the time it takes to accept a connection and how that
>>> time
>>> > > > evolves
>>> > > > > > > > >> with the connection rate.
>>> > > > > > > > >> Assuming that that time increases monotonically with
>>> > resource
>>> > > > > > > > >> utilization, the admissible rate of connections would
>>> > decrease
>>> > > > as
>>> > > > > the
>>> > > > > > > > >> server becomes more loaded, if the limit was set on
>>> queue
>>> > > size.
>>> > > > > > > > >>
>>> > > > > > > > >> Thanks,
>>> > > > > > > > >> Alexandre
>>> > > > > > > > >>
>>> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot <
>>> > dja...@confluent.io
>>> > > >
>>> > > > a
>>> > > > > > > écrit
>>> > > > > > > > :
>>> > > > > > > > >> >
>>> > > > > > > > >> > Hi Anna,
>>> > > > > > > > >> >
>>> > > > > > > > >> > Thanks for the KIP! I have few questions:
>>> > > > > > > > >> >
>>> > > > > > > > >> > 1. You mention that some clients may create a new
>>> > > connections
>>> > > > > for
>>> > > > > > > each
>>> > > > > > > > >> > requests: "Another example is clients that create a
>>> new
>>> > > > > connection
>>> > > > > > > for
>>> > > > > > > > >> each
>>> > > > > > > > >> > produce/consume request". I am curious here but do we
>>> know
>>> > > any
>>> > > > > > > clients
>>> > > > > > > > >> > behaving like this?
>>> > > > > > > > >> >
>>> > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving
>>> > clients
>>> > > > on
>>> > > > > the
>>> > > > > > > > >> other
>>> > > > > > > > >> > ones. Let's say that we define a quota of 10
>>> connections /
>>> > > sec
>>> > > > > for a
>>> > > > > > > > >> broker
>>> > > > > > > > >> > and that we have a misbehaving application constantly
>>> > trying
>>> > > > to
>>> > > > > > > create
>>> > > > > > > > >> 20
>>> > > > > > > > >> > connections on that broker. That application will
>>> > constantly
>>> > > > > hit the
>>> > > > > > > > >> quota
>>> > > > > > > > >> > and
>>> > > > > > > > >> > always have many pending connections in the queue
>>> waiting
>>> > to
>>> > > > be
>>> > > > > > > > >> accepted.
>>> > > > > > > > >> > Regular clients trying to connect would need to wait
>>> until
>>> > > all
>>> > > > > the
>>> > > > > > > > >> pending
>>> > > > > > > > >> > connections upfront in the queue are drained in the
>>> best
>>> > > case
>>> > > > > > > scenario
>>> > > > > > > > >> or
>>> > > > > > > > >> > won't be able to connect at all in the worst case
>>> scenario
>>> > > if
>>> > > > > the
>>> > > > > > > > queue
>>> > > > > > > > >> is
>>> > > > > > > > >> > full.
>>> > > > > > > > >> > Does it sound like a valid concern? How do you see
>>> this?
>>> > > > > > > > >> >
>>> > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues
>>> > which
>>> > > > > already
>>> > > > > > > > >> limit
>>> > > > > > > > >> > the maximum number of connections that can be
>>> accepted. I
>>> > > > > wonder if
>>> > > > > > > we
>>> > > > > > > > >> > could reach the same goal by making the size of the
>>> queue
>>> > > > > > > > configurable.
>>> > > > > > > > >> >
>>> > > > > > > > >> > 4. Did you consider doing something similar to the
>>> > > connections
>>> > > > > quota
>>> > > > > > > > >> which
>>> > > > > > > > >> > limits the number of connections per IP? Instead of
>>> rate
>>> > > > > limiting
>>> > > > > > > all
>>> > > > > > > > >> the
>>> > > > > > > > >> > creation,
>>> > > > > > > > >> > we could perhaps rate limit the number of creation
>>> per IP
>>> > as
>>> > > > > well.
>>> > > > > > > > That
>>> > > > > > > > >> > could
>>> > > > > > > > >> > perhaps reduce the effect on the other clients. That
>>> may
>>> > be
>>> > > > > harder
>>> > > > > > > to
>>> > > > > > > > >> > implement
>>> > > > > > > > >> > though.
>>> > > > > > > > >> >
>>> > > > > > > > >> > Best,
>>> > > > > > > > >> > David
>>> > > > > > > > >> >
>>> > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner <
>>> > > > a...@confluent.io
>>> > > > > >
>>> > > > > > > > wrote:
>>> > > > > > > > >> >
>>> > > > > > > > >> > > Hi,
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > I just created KIP-612 to allow limiting connection
>>> > > creation
>>> > > > > rate
>>> > > > > > > on
>>> > > > > > > > >> > > brokers, and would like to start a discussion.
>>> > > > > > > > >> > >
>>> > > > > > > > >> > >
>>> > > > > > > > >> > >
>>> > > > > > > > >>
>>> > > > > > > >
>>> > > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > Feedback and suggestions are welcome!
>>> > > > > > > > >> > >
>>> > > > > > > > >> > > Thanks,
>>> > > > > > > > >> > > Anna
>>> > > > > > > > >> > >
>>> > > > > > > > >>
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] KIP-612: Ability to Limit Connection Creation Rate on Brokers

Reply via email to