Hi Anna,

Thank you for your answers.

900. Sorry, I forgot the connection creation rate metric already
allows users to detect excessive rates. I agree per-IP metrics would
expose to a high-cardinality risk.
902. Thank you for the pointers. I will explore the existing stress
test in Trogdor.

Thanks,
Alexandre

Le mar. 19 mai 2020 à 00:51, Anna Povzner <a...@confluent.io> a écrit :
>
> I realized the KIP freeze is on May 20. I will start the voting thread now.
>
> On Mon, May 18, 2020 at 3:19 PM Anna Povzner <a...@confluent.io> wrote:
>
> > Thanks everyone for the feedback. I will start a voting thread tomorrow
> > morning if there are no more comments.
> >
> > Regards,
> > Anna
> >
> > On Mon, May 18, 2020 at 2:06 PM Anna Povzner <a...@confluent.io> wrote:
> >
> >> Hi Boyang,
> >>
> >> This KIP does not change the protocol with clients. The behavior is the
> >> same as with KIP-402 where the broker delays accepting new connections when
> >> the limit for the number of connections is reached. This KIP adds another
> >> reason for the delay (when the rate is reached). Similarly, when dropping a
> >> connection when per-IP limit is reached, except this KIP delays the
> >> response or may still accept the connection. Client may timeout on waiting
> >> for connection, and retry.
> >>
> >> Thanks,
> >> Anna
> >>
> >> On Mon, May 18, 2020 at 12:54 PM Boyang Chen <reluctanthero...@gmail.com>
> >> wrote:
> >>
> >>> Hey Anna,
> >>>
> >>> thanks for the KIP. Will this change be applied as one type of quota
> >>> violation, which for client side should be retriable? For EOS model
> >>> before
> >>> 2.6, the Streams client creates one producer for each input partition, so
> >>> it is actually possible to create thousands of producers when the service
> >>> is up. Just want to clarify what's the expected behavior to be seen on
> >>> the
> >>> client side?
> >>>
> >>> On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote:
> >>>
> >>> > Hi Alexandre,
> >>> >
> >>> > Thanks for your comments. My answers are below:
> >>> >
> >>> > 900. The KIP does not propose any new metrics because we already have
> >>> > metrics that will let us monitor connection attempts and the amount of
> >>> time
> >>> > the broker delays accepting new connections:
> >>> > 1. We have a per-listener (and per-processor) metric for connection
> >>> > creation rate:
> >>> >
> >>> >
> >>> kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate
> >>> > 2. We have per-listener metrics that track the amount of time Acceptor
> >>> is
> >>> > blocked from accepting connections:
> >>> >
> >>> >
> >>> kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name}
> >>> > Note that adding per IP JMX metrics may end up adding a lot of
> >>> overhead,
> >>> > especially for clusters with a large number of clients and many
> >>> different
> >>> > IP addresses. If we ever want to add the metric, perhaps we could
> >>> propose a
> >>> > separate KIP, but that would require some more evaluation of potential
> >>> > overhead.
> >>> >
> >>> > 901. Yes, I updated the wiki with the approach for enforcing per IP
> >>> limits
> >>> > (not dropping right away), as I described in my response to Rajini.
> >>> >
> >>> > 902. Any additional stress testing is always super useful. I am going
> >>> to
> >>> > have PR with the first half of the KIP ready soon (broker-wider and
> >>> > per-listener limits). Perhaps it could be worthwhile to see if it makes
> >>> > sense to add stress testing to muckrake tests. Also, check out
> >>> connection
> >>> > stress workloads in Trogdor and whether they are sufficient or could be
> >>> > extended:
> >>> >
> >>> >
> >>> https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload
> >>> >
> >>> > Regards,
> >>> > Anna
> >>> >
> >>> > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram <
> >>> rajinisiva...@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > Hi Anna,
> >>> > >
> >>> > > Thanks for the response, sounds good.
> >>> > >
> >>> > > Regards,
> >>> > >
> >>> > > Rajini
> >>> > >
> >>> > >
> >>> > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io>
> >>> wrote:
> >>> > >
> >>> > > > Hi Rajini,
> >>> > > >
> >>> > > > Thanks for reviewing the KIP!
> >>> > > >
> >>> > > > I agree with your suggestion to make per-IP connection rate quota a
> >>> > > dynamic
> >>> > > > quota for entity name IP. This will allow configuring connection
> >>> rate
> >>> > > for a
> >>> > > > particular IP as well. I updated the wiki accordingly.
> >>> > > >
> >>> > > > Your second concern makes sense -- rejecting the connection right
> >>> away
> >>> > > will
> >>> > > > likely cause a new connection from the same client. I am concerned
> >>> > about
> >>> > > > delaying new connections for processing later, because if the
> >>> > connections
> >>> > > > keep coming with the high rate, there may be potentially a large
> >>> > backlog
> >>> > > > and connections may start timing out before the broker gets to
> >>> > processing
> >>> > > > them. For example, if clients come through proxy, there may be
> >>> > > > potentially a large number of incoming connections with the same
> >>> IP.
> >>> > > >
> >>> > > > What do you think about the following option:
> >>> > > > * Once per-IP connection rate reaches the limit, accept or drop
> >>> (clean
> >>> > > up)
> >>> > > > the connection after a delay depending on whether the quota is
> >>> still
> >>> > > > violated. We could re-use the mechanism implemented with KIP-306
> >>> where
> >>> > > the
> >>> > > > broker delays the response for failed client authentication. The
> >>> delay
> >>> > > will
> >>> > > > be set to min(delay calculated based on the rate quota, 1 second),
> >>> > which
> >>> > > > matches the max delay for request quota.
> >>> > > >
> >>> > > > I think this option is somewhat your suggestion with delaying
> >>> accepting
> >>> > > per
> >>> > > > IP connections that reached the rate limit, but with protection in
> >>> > place
> >>> > > to
> >>> > > > make sure the number of delayed connections does not blow up. What
> >>> do
> >>> > you
> >>> > > > think?
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Anna
> >>> > > >
> >>> > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez <
> >>> > > > alexandre.dupr...@gmail.com> wrote:
> >>> > > >
> >>> > > > > Hi Anna,
> >>> > > > >
> >>> > > > > Thank you for your answers and explanations.
> >>> > > > >
> >>> > > > > A couple of additional comments:
> >>> > > > >
> >>> > > > > 900. KIP-612 does not intend to dedicate a metric to the
> >>> throttling
> >>> > of
> >>> > > > > incoming connections. I wonder if such a metric would be handy
> >>> for
> >>> > > > > monitoring and help set-up metric-based alarming if one wishes to
> >>> > > > > capture this type of incident?
> >>> > > > >
> >>> > > > > 901. Following-up on Rajini's point 2 above - from my
> >>> understanding,
> >>> > > > > this new quota should prevent excess CPU consumption in
> >>> > > > > Processor#run() method when a new connection has been accepted.
> >>> > > > > Through the throttling in place, connections will be delayed as
> >>> > > > > indicated by the KIP's specifications:
> >>> > > > >
> >>> > > > > " If connection creation rate on the broker exceeds the
> >>> broker-wide
> >>> > > > > limit, the broker will delay accepting a new connection by an
> >>> amount
> >>> > > > > of time that brings the rate within the limit."
> >>> > > > >
> >>> > > > > You may be referring to the following sentence:
> >>> > > > >
> >>> > > > > "A new broker configuration option will be added to limit the
> >>> rate at
> >>> > > > > which connections will be accepted for each IP address. New
> >>> > > > > connections for the IP will be dropped once the limit is
> >>> reached."?
> >>> > > > >
> >>> > > > > 902. It may be interesting to capture the data with and without
> >>> > > > > connection throttling under stress scenarios. You may have these
> >>> data
> >>> > > > > already. If you need a pair of hands to do some stress tests
> >>> once you
> >>> > > > > have a POC or a PR, I am happy to contribute :)
> >>> > > > >
> >>> > > > > Thanks,
> >>> > > > > Alexandre
> >>> > > > >
> >>> > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram <
> >>> rajinisiva...@gmail.com
> >>> > >
> >>> > > a
> >>> > > > > écrit :
> >>> > > > > >
> >>> > > > > > Hi Anna,
> >>> > > > > >
> >>> > > > > > Thanks for the KIP, looks good overall. A couple of comments
> >>> about
> >>> > > > per-IP
> >>> > > > > > connection quotas:
> >>> > > > > >
> >>> > > > > > 1) Should we consider making per-IP quota similar to other
> >>> quotas?
> >>> > > > > > Configured as a dynamic quota for entity type IP, with per-IP
> >>> limit
> >>> > > as
> >>> > > > > well
> >>> > > > > > as defaults? Perhaps that would fit better rather than configs?
> >>> > > > > >
> >>> > > > > > 2) The current proposal drops connections after accepting
> >>> > connections
> >>> > > > for
> >>> > > > > > per-IP limit. We do this in other cases too, but in this case,
> >>> > should
> >>> > > > we
> >>> > > > > > throttle instead? My point is what is the quota protecting? If
> >>> we
> >>> > > want
> >>> > > > to
> >>> > > > > > limit rate of accepted connections, then accepting a
> >>> connection and
> >>> > > > then
> >>> > > > > > dropping doesn't really help since that IP is going to
> >>> reconnect.
> >>> > If
> >>> > > we
> >>> > > > > > want to rate limit what happens next, i.e. authentication, then
> >>> > > > > > throttling the accepted connection so its processing is delayed
> >>> > would
> >>> > > > > > perhaps be better?
> >>> > > > > >
> >>> > > > > > Regards,
> >>> > > > > >
> >>> > > > > > Rajini
> >>> > > > > >
> >>> > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot <
> >>> dja...@confluent.io>
> >>> > > > wrote:
> >>> > > > > >
> >>> > > > > > > Hi Anna,
> >>> > > > > > >
> >>> > > > > > > Thanks for your answers and the updated KIP. Looks good to
> >>> me!
> >>> > > > > > >
> >>> > > > > > > Best,
> >>> > > > > > > David
> >>> > > > > > >
> >>> > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner <
> >>> a...@confluent.io
> >>> > >
> >>> > > > > wrote:
> >>> > > > > > >
> >>> > > > > > > > I updated the KIP to add a new broker configuration to
> >>> limit
> >>> > > > > connection
> >>> > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip.
> >>> Once
> >>> > > the
> >>> > > > > limit
> >>> > > > > > > > is reached for a particular IP address, the broker will
> >>> reject
> >>> > > the
> >>> > > > > > > > connection from that IP (close the connection it accepted)
> >>> and
> >>> > > > > continue
> >>> > > > > > > > rejecting them until the rate is back within the rate
> >>> limit.
> >>> > > > > > > >
> >>> > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner <
> >>> > a...@confluent.io
> >>> > > >
> >>> > > > > wrote:
> >>> > > > > > > >
> >>> > > > > > > > > Hi David and Alexandre,
> >>> > > > > > > > >
> >>> > > > > > > > > Thanks so much for your feedback! Here are my answers:
> >>> > > > > > > > >
> >>> > > > > > > > > 1. Yes, we have seen several cases of clients that
> >>> create a
> >>> > new
> >>> > > > > > > > connection
> >>> > > > > > > > > per produce/consume request. One hypothesis is someone
> >>> who is
> >>> > > > used
> >>> > > > > to
> >>> > > > > > > > > connection pooling may accidentally write a Kafka client
> >>> that
> >>> > > > > creates a
> >>> > > > > > > > new
> >>> > > > > > > > > connection every time.
> >>> > > > > > > > >
> >>> > > > > > > > > 2 & 4. That's a good point I haven't considered. I think
> >>> it
> >>> > > makes
> >>> > > > > sense
> >>> > > > > > > > to
> >>> > > > > > > > > provide an ability to limit connection creations per IP
> >>> as
> >>> > > well.
> >>> > > > > This
> >>> > > > > > > is
> >>> > > > > > > > > not hard to implement -- the broker already keeps track
> >>> of
> >>> > the
> >>> > > > > number
> >>> > > > > > > of
> >>> > > > > > > > > connections per IP, and immediately closes a new
> >>> connection
> >>> > if
> >>> > > it
> >>> > > > > comes
> >>> > > > > > > > > from an IP that reached the connection limit. So, we
> >>> could
> >>> > > > > additionally
> >>> > > > > > > > > track the rate, and close the connection from IP that
> >>> exceeds
> >>> > > the
> >>> > > > > rate.
> >>> > > > > > > > One
> >>> > > > > > > > > slight concern is whether keeping track of per IP rates
> >>> and
> >>> > > > quotas
> >>> > > > > adds
> >>> > > > > > > > > overhead (CPU and memory). But perhaps it is not a
> >>> problem if
> >>> > > we
> >>> > > > > use
> >>> > > > > > > > > expiring sensors.
> >>> > > > > > > > >
> >>> > > > > > > > > It would still make sense to limit the overall connection
> >>> > > > creation
> >>> > > > > rate
> >>> > > > > > > > > for the Kafka clusters which are shared among many
> >>> different
> >>> > > > > > > > > applications/clients, since they may spike at the same
> >>> time
> >>> > > > > bringing
> >>> > > > > > > the
> >>> > > > > > > > > total rate too high.
> >>> > > > > > > > >
> >>> > > > > > > > > 3. Controlling connection queue sizes only controls the
> >>> share
> >>> > > of
> >>> > > > > time
> >>> > > > > > > > > network threads use for creating new connections (and
> >>> > accepting
> >>> > > > on
> >>> > > > > > > > Acceptor
> >>> > > > > > > > > thread) vs. doing other work on each Processor
> >>> iteration. It
> >>> > > does
> >>> > > > > not
> >>> > > > > > > > > directly control how processing connection creations
> >>> would be
> >>> > > > > related
> >>> > > > > > > to
> >>> > > > > > > > > other processing done by brokers like on request handler
> >>> > > threads.
> >>> > > > > So,
> >>> > > > > > > > while
> >>> > > > > > > > > controlling queue size may mitigate the issue for some
> >>> of the
> >>> > > > > > > workloads,
> >>> > > > > > > > it
> >>> > > > > > > > > does not guarantee that. Plus, if we want to limit how
> >>> many
> >>> > > > > connections
> >>> > > > > > > > are
> >>> > > > > > > > > created per IP, the queue size approach would not work,
> >>> > unless
> >>> > > we
> >>> > > > > go
> >>> > > > > > > > with a
> >>> > > > > > > > > "share" of the queue, which I think even further obscures
> >>> > what
> >>> > > > that
> >>> > > > > > > > setting
> >>> > > > > > > > > means (and what we would achieve as an end result). Does
> >>> this
> >>> > > > > answer
> >>> > > > > > > the
> >>> > > > > > > > > question?
> >>> > > > > > > > >
> >>> > > > > > > > > If there are no objections, I will update the KIP to add
> >>> per
> >>> > IP
> >>> > > > > > > > connection
> >>> > > > > > > > > rate limits (config and enforcement).
> >>> > > > > > > > >
> >>> > > > > > > > > Thanks,
> >>> > > > > > > > >
> >>> > > > > > > > > Anna
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez <
> >>> > > > > > > > > alexandre.dupr...@gmail.com> wrote:
> >>> > > > > > > > >
> >>> > > > > > > > >> Hello,
> >>> > > > > > > > >>
> >>> > > > > > > > >> Thank you for the KIP.
> >>> > > > > > > > >>
> >>> > > > > > > > >> I experienced in the past genuine broker brownouts due
> >>> to
> >>> > > > > connection
> >>> > > > > > > > >> storms consuming most of the CPU available on the server
> >>> > and I
> >>> > > > > think
> >>> > > > > > > > >> it is useful to protect against it.
> >>> > > > > > > > >>
> >>> > > > > > > > >> I tend to share the questions asked in points 2 and 4
> >>> from
> >>> > > > David.
> >>> > > > > Is
> >>> > > > > > > > >> there still a risk of denial of service if the limit
> >>> applies
> >>> > > at
> >>> > > > > the
> >>> > > > > > > > >> listener-level without differentiating between (an)
> >>> > > “offending”
> >>> > > > > > > > >> client(s) and the others?
> >>> > > > > > > > >>
> >>> > > > > > > > >> To rebound on point 3 - conceptually one difference
> >>> between
> >>> > > > > capping
> >>> > > > > > > > >> the queue size or throttling as presented in the KIP
> >>> would
> >>> > > come
> >>> > > > > from
> >>> > > > > > > > >> the time it takes to accept a connection and how that
> >>> time
> >>> > > > evolves
> >>> > > > > > > > >> with the connection rate.
> >>> > > > > > > > >> Assuming that that time increases monotonically with
> >>> > resource
> >>> > > > > > > > >> utilization, the admissible rate of connections would
> >>> > decrease
> >>> > > > as
> >>> > > > > the
> >>> > > > > > > > >> server becomes more loaded, if the limit was set on
> >>> queue
> >>> > > size.
> >>> > > > > > > > >>
> >>> > > > > > > > >> Thanks,
> >>> > > > > > > > >> Alexandre
> >>> > > > > > > > >>
> >>> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot <
> >>> > dja...@confluent.io
> >>> > > >
> >>> > > > a
> >>> > > > > > > écrit
> >>> > > > > > > > :
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Hi Anna,
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Thanks for the KIP! I have few questions:
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > 1. You mention that some clients may create a new
> >>> > > connections
> >>> > > > > for
> >>> > > > > > > each
> >>> > > > > > > > >> > requests: "Another example is clients that create a
> >>> new
> >>> > > > > connection
> >>> > > > > > > for
> >>> > > > > > > > >> each
> >>> > > > > > > > >> > produce/consume request". I am curious here but do we
> >>> know
> >>> > > any
> >>> > > > > > > clients
> >>> > > > > > > > >> > behaving like this?
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving
> >>> > clients
> >>> > > > on
> >>> > > > > the
> >>> > > > > > > > >> other
> >>> > > > > > > > >> > ones. Let's say that we define a quota of 10
> >>> connections /
> >>> > > sec
> >>> > > > > for a
> >>> > > > > > > > >> broker
> >>> > > > > > > > >> > and that we have a misbehaving application constantly
> >>> > trying
> >>> > > > to
> >>> > > > > > > create
> >>> > > > > > > > >> 20
> >>> > > > > > > > >> > connections on that broker. That application will
> >>> > constantly
> >>> > > > > hit the
> >>> > > > > > > > >> quota
> >>> > > > > > > > >> > and
> >>> > > > > > > > >> > always have many pending connections in the queue
> >>> waiting
> >>> > to
> >>> > > > be
> >>> > > > > > > > >> accepted.
> >>> > > > > > > > >> > Regular clients trying to connect would need to wait
> >>> until
> >>> > > all
> >>> > > > > the
> >>> > > > > > > > >> pending
> >>> > > > > > > > >> > connections upfront in the queue are drained in the
> >>> best
> >>> > > case
> >>> > > > > > > scenario
> >>> > > > > > > > >> or
> >>> > > > > > > > >> > won't be able to connect at all in the worst case
> >>> scenario
> >>> > > if
> >>> > > > > the
> >>> > > > > > > > queue
> >>> > > > > > > > >> is
> >>> > > > > > > > >> > full.
> >>> > > > > > > > >> > Does it sound like a valid concern? How do you see
> >>> this?
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues
> >>> > which
> >>> > > > > already
> >>> > > > > > > > >> limit
> >>> > > > > > > > >> > the maximum number of connections that can be
> >>> accepted. I
> >>> > > > > wonder if
> >>> > > > > > > we
> >>> > > > > > > > >> > could reach the same goal by making the size of the
> >>> queue
> >>> > > > > > > > configurable.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > 4. Did you consider doing something similar to the
> >>> > > connections
> >>> > > > > quota
> >>> > > > > > > > >> which
> >>> > > > > > > > >> > limits the number of connections per IP? Instead of
> >>> rate
> >>> > > > > limiting
> >>> > > > > > > all
> >>> > > > > > > > >> the
> >>> > > > > > > > >> > creation,
> >>> > > > > > > > >> > we could perhaps rate limit the number of creation
> >>> per IP
> >>> > as
> >>> > > > > well.
> >>> > > > > > > > That
> >>> > > > > > > > >> > could
> >>> > > > > > > > >> > perhaps reduce the effect on the other clients. That
> >>> may
> >>> > be
> >>> > > > > harder
> >>> > > > > > > to
> >>> > > > > > > > >> > implement
> >>> > > > > > > > >> > though.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Best,
> >>> > > > > > > > >> > David
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner <
> >>> > > > a...@confluent.io
> >>> > > > > >
> >>> > > > > > > > wrote:
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > > Hi,
> >>> > > > > > > > >> > >
> >>> > > > > > > > >> > > I just created KIP-612 to allow limiting connection
> >>> > > creation
> >>> > > > > rate
> >>> > > > > > > on
> >>> > > > > > > > >> > > brokers, and would like to start a discussion.
> >>> > > > > > > > >> > >
> >>> > > > > > > > >> > >
> >>> > > > > > > > >> > >
> >>> > > > > > > > >>
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers
> >>> > > > > > > > >> > >
> >>> > > > > > > > >> > > Feedback and suggestions are welcome!
> >>> > > > > > > > >> > >
> >>> > > > > > > > >> > > Thanks,
> >>> > > > > > > > >> > > Anna
> >>> > > > > > > > >> > >
> >>> > > > > > > > >>
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>

Reply via email to