I realized the KIP freeze is on May 20. I will start the voting thread now.
On Mon, May 18, 2020 at 3:19 PM Anna Povzner <a...@confluent.io> wrote: > Thanks everyone for the feedback. I will start a voting thread tomorrow > morning if there are no more comments. > > Regards, > Anna > > On Mon, May 18, 2020 at 2:06 PM Anna Povzner <a...@confluent.io> wrote: > >> Hi Boyang, >> >> This KIP does not change the protocol with clients. The behavior is the >> same as with KIP-402 where the broker delays accepting new connections when >> the limit for the number of connections is reached. This KIP adds another >> reason for the delay (when the rate is reached). Similarly, when dropping a >> connection when per-IP limit is reached, except this KIP delays the >> response or may still accept the connection. Client may timeout on waiting >> for connection, and retry. >> >> Thanks, >> Anna >> >> On Mon, May 18, 2020 at 12:54 PM Boyang Chen <reluctanthero...@gmail.com> >> wrote: >> >>> Hey Anna, >>> >>> thanks for the KIP. Will this change be applied as one type of quota >>> violation, which for client side should be retriable? For EOS model >>> before >>> 2.6, the Streams client creates one producer for each input partition, so >>> it is actually possible to create thousands of producers when the service >>> is up. Just want to clarify what's the expected behavior to be seen on >>> the >>> client side? >>> >>> On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote: >>> >>> > Hi Alexandre, >>> > >>> > Thanks for your comments. My answers are below: >>> > >>> > 900. The KIP does not propose any new metrics because we already have >>> > metrics that will let us monitor connection attempts and the amount of >>> time >>> > the broker delays accepting new connections: >>> > 1. We have a per-listener (and per-processor) metric for connection >>> > creation rate: >>> > >>> > >>> kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate >>> > 2. We have per-listener metrics that track the amount of time Acceptor >>> is >>> > blocked from accepting connections: >>> > >>> > >>> kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name} >>> > Note that adding per IP JMX metrics may end up adding a lot of >>> overhead, >>> > especially for clusters with a large number of clients and many >>> different >>> > IP addresses. If we ever want to add the metric, perhaps we could >>> propose a >>> > separate KIP, but that would require some more evaluation of potential >>> > overhead. >>> > >>> > 901. Yes, I updated the wiki with the approach for enforcing per IP >>> limits >>> > (not dropping right away), as I described in my response to Rajini. >>> > >>> > 902. Any additional stress testing is always super useful. I am going >>> to >>> > have PR with the first half of the KIP ready soon (broker-wider and >>> > per-listener limits). Perhaps it could be worthwhile to see if it makes >>> > sense to add stress testing to muckrake tests. Also, check out >>> connection >>> > stress workloads in Trogdor and whether they are sufficient or could be >>> > extended: >>> > >>> > >>> https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload >>> > >>> > Regards, >>> > Anna >>> > >>> > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram < >>> rajinisiva...@gmail.com> >>> > wrote: >>> > >>> > > Hi Anna, >>> > > >>> > > Thanks for the response, sounds good. >>> > > >>> > > Regards, >>> > > >>> > > Rajini >>> > > >>> > > >>> > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io> >>> wrote: >>> > > >>> > > > Hi Rajini, >>> > > > >>> > > > Thanks for reviewing the KIP! >>> > > > >>> > > > I agree with your suggestion to make per-IP connection rate quota a >>> > > dynamic >>> > > > quota for entity name IP. This will allow configuring connection >>> rate >>> > > for a >>> > > > particular IP as well. I updated the wiki accordingly. >>> > > > >>> > > > Your second concern makes sense -- rejecting the connection right >>> away >>> > > will >>> > > > likely cause a new connection from the same client. I am concerned >>> > about >>> > > > delaying new connections for processing later, because if the >>> > connections >>> > > > keep coming with the high rate, there may be potentially a large >>> > backlog >>> > > > and connections may start timing out before the broker gets to >>> > processing >>> > > > them. For example, if clients come through proxy, there may be >>> > > > potentially a large number of incoming connections with the same >>> IP. >>> > > > >>> > > > What do you think about the following option: >>> > > > * Once per-IP connection rate reaches the limit, accept or drop >>> (clean >>> > > up) >>> > > > the connection after a delay depending on whether the quota is >>> still >>> > > > violated. We could re-use the mechanism implemented with KIP-306 >>> where >>> > > the >>> > > > broker delays the response for failed client authentication. The >>> delay >>> > > will >>> > > > be set to min(delay calculated based on the rate quota, 1 second), >>> > which >>> > > > matches the max delay for request quota. >>> > > > >>> > > > I think this option is somewhat your suggestion with delaying >>> accepting >>> > > per >>> > > > IP connections that reached the rate limit, but with protection in >>> > place >>> > > to >>> > > > make sure the number of delayed connections does not blow up. What >>> do >>> > you >>> > > > think? >>> > > > >>> > > > Thanks, >>> > > > Anna >>> > > > >>> > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez < >>> > > > alexandre.dupr...@gmail.com> wrote: >>> > > > >>> > > > > Hi Anna, >>> > > > > >>> > > > > Thank you for your answers and explanations. >>> > > > > >>> > > > > A couple of additional comments: >>> > > > > >>> > > > > 900. KIP-612 does not intend to dedicate a metric to the >>> throttling >>> > of >>> > > > > incoming connections. I wonder if such a metric would be handy >>> for >>> > > > > monitoring and help set-up metric-based alarming if one wishes to >>> > > > > capture this type of incident? >>> > > > > >>> > > > > 901. Following-up on Rajini's point 2 above - from my >>> understanding, >>> > > > > this new quota should prevent excess CPU consumption in >>> > > > > Processor#run() method when a new connection has been accepted. >>> > > > > Through the throttling in place, connections will be delayed as >>> > > > > indicated by the KIP's specifications: >>> > > > > >>> > > > > " If connection creation rate on the broker exceeds the >>> broker-wide >>> > > > > limit, the broker will delay accepting a new connection by an >>> amount >>> > > > > of time that brings the rate within the limit." >>> > > > > >>> > > > > You may be referring to the following sentence: >>> > > > > >>> > > > > "A new broker configuration option will be added to limit the >>> rate at >>> > > > > which connections will be accepted for each IP address. New >>> > > > > connections for the IP will be dropped once the limit is >>> reached."? >>> > > > > >>> > > > > 902. It may be interesting to capture the data with and without >>> > > > > connection throttling under stress scenarios. You may have these >>> data >>> > > > > already. If you need a pair of hands to do some stress tests >>> once you >>> > > > > have a POC or a PR, I am happy to contribute :) >>> > > > > >>> > > > > Thanks, >>> > > > > Alexandre >>> > > > > >>> > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram < >>> rajinisiva...@gmail.com >>> > > >>> > > a >>> > > > > écrit : >>> > > > > > >>> > > > > > Hi Anna, >>> > > > > > >>> > > > > > Thanks for the KIP, looks good overall. A couple of comments >>> about >>> > > > per-IP >>> > > > > > connection quotas: >>> > > > > > >>> > > > > > 1) Should we consider making per-IP quota similar to other >>> quotas? >>> > > > > > Configured as a dynamic quota for entity type IP, with per-IP >>> limit >>> > > as >>> > > > > well >>> > > > > > as defaults? Perhaps that would fit better rather than configs? >>> > > > > > >>> > > > > > 2) The current proposal drops connections after accepting >>> > connections >>> > > > for >>> > > > > > per-IP limit. We do this in other cases too, but in this case, >>> > should >>> > > > we >>> > > > > > throttle instead? My point is what is the quota protecting? If >>> we >>> > > want >>> > > > to >>> > > > > > limit rate of accepted connections, then accepting a >>> connection and >>> > > > then >>> > > > > > dropping doesn't really help since that IP is going to >>> reconnect. >>> > If >>> > > we >>> > > > > > want to rate limit what happens next, i.e. authentication, then >>> > > > > > throttling the accepted connection so its processing is delayed >>> > would >>> > > > > > perhaps be better? >>> > > > > > >>> > > > > > Regards, >>> > > > > > >>> > > > > > Rajini >>> > > > > > >>> > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot < >>> dja...@confluent.io> >>> > > > wrote: >>> > > > > > >>> > > > > > > Hi Anna, >>> > > > > > > >>> > > > > > > Thanks for your answers and the updated KIP. Looks good to >>> me! >>> > > > > > > >>> > > > > > > Best, >>> > > > > > > David >>> > > > > > > >>> > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner < >>> a...@confluent.io >>> > > >>> > > > > wrote: >>> > > > > > > >>> > > > > > > > I updated the KIP to add a new broker configuration to >>> limit >>> > > > > connection >>> > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip. >>> Once >>> > > the >>> > > > > limit >>> > > > > > > > is reached for a particular IP address, the broker will >>> reject >>> > > the >>> > > > > > > > connection from that IP (close the connection it accepted) >>> and >>> > > > > continue >>> > > > > > > > rejecting them until the rate is back within the rate >>> limit. >>> > > > > > > > >>> > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner < >>> > a...@confluent.io >>> > > > >>> > > > > wrote: >>> > > > > > > > >>> > > > > > > > > Hi David and Alexandre, >>> > > > > > > > > >>> > > > > > > > > Thanks so much for your feedback! Here are my answers: >>> > > > > > > > > >>> > > > > > > > > 1. Yes, we have seen several cases of clients that >>> create a >>> > new >>> > > > > > > > connection >>> > > > > > > > > per produce/consume request. One hypothesis is someone >>> who is >>> > > > used >>> > > > > to >>> > > > > > > > > connection pooling may accidentally write a Kafka client >>> that >>> > > > > creates a >>> > > > > > > > new >>> > > > > > > > > connection every time. >>> > > > > > > > > >>> > > > > > > > > 2 & 4. That's a good point I haven't considered. I think >>> it >>> > > makes >>> > > > > sense >>> > > > > > > > to >>> > > > > > > > > provide an ability to limit connection creations per IP >>> as >>> > > well. >>> > > > > This >>> > > > > > > is >>> > > > > > > > > not hard to implement -- the broker already keeps track >>> of >>> > the >>> > > > > number >>> > > > > > > of >>> > > > > > > > > connections per IP, and immediately closes a new >>> connection >>> > if >>> > > it >>> > > > > comes >>> > > > > > > > > from an IP that reached the connection limit. So, we >>> could >>> > > > > additionally >>> > > > > > > > > track the rate, and close the connection from IP that >>> exceeds >>> > > the >>> > > > > rate. >>> > > > > > > > One >>> > > > > > > > > slight concern is whether keeping track of per IP rates >>> and >>> > > > quotas >>> > > > > adds >>> > > > > > > > > overhead (CPU and memory). But perhaps it is not a >>> problem if >>> > > we >>> > > > > use >>> > > > > > > > > expiring sensors. >>> > > > > > > > > >>> > > > > > > > > It would still make sense to limit the overall connection >>> > > > creation >>> > > > > rate >>> > > > > > > > > for the Kafka clusters which are shared among many >>> different >>> > > > > > > > > applications/clients, since they may spike at the same >>> time >>> > > > > bringing >>> > > > > > > the >>> > > > > > > > > total rate too high. >>> > > > > > > > > >>> > > > > > > > > 3. Controlling connection queue sizes only controls the >>> share >>> > > of >>> > > > > time >>> > > > > > > > > network threads use for creating new connections (and >>> > accepting >>> > > > on >>> > > > > > > > Acceptor >>> > > > > > > > > thread) vs. doing other work on each Processor >>> iteration. It >>> > > does >>> > > > > not >>> > > > > > > > > directly control how processing connection creations >>> would be >>> > > > > related >>> > > > > > > to >>> > > > > > > > > other processing done by brokers like on request handler >>> > > threads. >>> > > > > So, >>> > > > > > > > while >>> > > > > > > > > controlling queue size may mitigate the issue for some >>> of the >>> > > > > > > workloads, >>> > > > > > > > it >>> > > > > > > > > does not guarantee that. Plus, if we want to limit how >>> many >>> > > > > connections >>> > > > > > > > are >>> > > > > > > > > created per IP, the queue size approach would not work, >>> > unless >>> > > we >>> > > > > go >>> > > > > > > > with a >>> > > > > > > > > "share" of the queue, which I think even further obscures >>> > what >>> > > > that >>> > > > > > > > setting >>> > > > > > > > > means (and what we would achieve as an end result). Does >>> this >>> > > > > answer >>> > > > > > > the >>> > > > > > > > > question? >>> > > > > > > > > >>> > > > > > > > > If there are no objections, I will update the KIP to add >>> per >>> > IP >>> > > > > > > > connection >>> > > > > > > > > rate limits (config and enforcement). >>> > > > > > > > > >>> > > > > > > > > Thanks, >>> > > > > > > > > >>> > > > > > > > > Anna >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez < >>> > > > > > > > > alexandre.dupr...@gmail.com> wrote: >>> > > > > > > > > >>> > > > > > > > >> Hello, >>> > > > > > > > >> >>> > > > > > > > >> Thank you for the KIP. >>> > > > > > > > >> >>> > > > > > > > >> I experienced in the past genuine broker brownouts due >>> to >>> > > > > connection >>> > > > > > > > >> storms consuming most of the CPU available on the server >>> > and I >>> > > > > think >>> > > > > > > > >> it is useful to protect against it. >>> > > > > > > > >> >>> > > > > > > > >> I tend to share the questions asked in points 2 and 4 >>> from >>> > > > David. >>> > > > > Is >>> > > > > > > > >> there still a risk of denial of service if the limit >>> applies >>> > > at >>> > > > > the >>> > > > > > > > >> listener-level without differentiating between (an) >>> > > “offending” >>> > > > > > > > >> client(s) and the others? >>> > > > > > > > >> >>> > > > > > > > >> To rebound on point 3 - conceptually one difference >>> between >>> > > > > capping >>> > > > > > > > >> the queue size or throttling as presented in the KIP >>> would >>> > > come >>> > > > > from >>> > > > > > > > >> the time it takes to accept a connection and how that >>> time >>> > > > evolves >>> > > > > > > > >> with the connection rate. >>> > > > > > > > >> Assuming that that time increases monotonically with >>> > resource >>> > > > > > > > >> utilization, the admissible rate of connections would >>> > decrease >>> > > > as >>> > > > > the >>> > > > > > > > >> server becomes more loaded, if the limit was set on >>> queue >>> > > size. >>> > > > > > > > >> >>> > > > > > > > >> Thanks, >>> > > > > > > > >> Alexandre >>> > > > > > > > >> >>> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot < >>> > dja...@confluent.io >>> > > > >>> > > > a >>> > > > > > > écrit >>> > > > > > > > : >>> > > > > > > > >> > >>> > > > > > > > >> > Hi Anna, >>> > > > > > > > >> > >>> > > > > > > > >> > Thanks for the KIP! I have few questions: >>> > > > > > > > >> > >>> > > > > > > > >> > 1. You mention that some clients may create a new >>> > > connections >>> > > > > for >>> > > > > > > each >>> > > > > > > > >> > requests: "Another example is clients that create a >>> new >>> > > > > connection >>> > > > > > > for >>> > > > > > > > >> each >>> > > > > > > > >> > produce/consume request". I am curious here but do we >>> know >>> > > any >>> > > > > > > clients >>> > > > > > > > >> > behaving like this? >>> > > > > > > > >> > >>> > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving >>> > clients >>> > > > on >>> > > > > the >>> > > > > > > > >> other >>> > > > > > > > >> > ones. Let's say that we define a quota of 10 >>> connections / >>> > > sec >>> > > > > for a >>> > > > > > > > >> broker >>> > > > > > > > >> > and that we have a misbehaving application constantly >>> > trying >>> > > > to >>> > > > > > > create >>> > > > > > > > >> 20 >>> > > > > > > > >> > connections on that broker. That application will >>> > constantly >>> > > > > hit the >>> > > > > > > > >> quota >>> > > > > > > > >> > and >>> > > > > > > > >> > always have many pending connections in the queue >>> waiting >>> > to >>> > > > be >>> > > > > > > > >> accepted. >>> > > > > > > > >> > Regular clients trying to connect would need to wait >>> until >>> > > all >>> > > > > the >>> > > > > > > > >> pending >>> > > > > > > > >> > connections upfront in the queue are drained in the >>> best >>> > > case >>> > > > > > > scenario >>> > > > > > > > >> or >>> > > > > > > > >> > won't be able to connect at all in the worst case >>> scenario >>> > > if >>> > > > > the >>> > > > > > > > queue >>> > > > > > > > >> is >>> > > > > > > > >> > full. >>> > > > > > > > >> > Does it sound like a valid concern? How do you see >>> this? >>> > > > > > > > >> > >>> > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues >>> > which >>> > > > > already >>> > > > > > > > >> limit >>> > > > > > > > >> > the maximum number of connections that can be >>> accepted. I >>> > > > > wonder if >>> > > > > > > we >>> > > > > > > > >> > could reach the same goal by making the size of the >>> queue >>> > > > > > > > configurable. >>> > > > > > > > >> > >>> > > > > > > > >> > 4. Did you consider doing something similar to the >>> > > connections >>> > > > > quota >>> > > > > > > > >> which >>> > > > > > > > >> > limits the number of connections per IP? Instead of >>> rate >>> > > > > limiting >>> > > > > > > all >>> > > > > > > > >> the >>> > > > > > > > >> > creation, >>> > > > > > > > >> > we could perhaps rate limit the number of creation >>> per IP >>> > as >>> > > > > well. >>> > > > > > > > That >>> > > > > > > > >> > could >>> > > > > > > > >> > perhaps reduce the effect on the other clients. That >>> may >>> > be >>> > > > > harder >>> > > > > > > to >>> > > > > > > > >> > implement >>> > > > > > > > >> > though. >>> > > > > > > > >> > >>> > > > > > > > >> > Best, >>> > > > > > > > >> > David >>> > > > > > > > >> > >>> > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner < >>> > > > a...@confluent.io >>> > > > > > >>> > > > > > > > wrote: >>> > > > > > > > >> > >>> > > > > > > > >> > > Hi, >>> > > > > > > > >> > > >>> > > > > > > > >> > > I just created KIP-612 to allow limiting connection >>> > > creation >>> > > > > rate >>> > > > > > > on >>> > > > > > > > >> > > brokers, and would like to start a discussion. >>> > > > > > > > >> > > >>> > > > > > > > >> > > >>> > > > > > > > >> > > >>> > > > > > > > >> >>> > > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers >>> > > > > > > > >> > > >>> > > > > > > > >> > > Feedback and suggestions are welcome! >>> > > > > > > > >> > > >>> > > > > > > > >> > > Thanks, >>> > > > > > > > >> > > Anna >>> > > > > > > > >> > > >>> > > > > > > > >> >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>