Hi Anna, Thank you for your answers.
900. Sorry, I forgot the connection creation rate metric already allows users to detect excessive rates. I agree per-IP metrics would expose to a high-cardinality risk. 902. Thank you for the pointers. I will explore the existing stress test in Trogdor. Thanks, Alexandre Le mar. 19 mai 2020 à 00:51, Anna Povzner <a...@confluent.io> a écrit : > > I realized the KIP freeze is on May 20. I will start the voting thread now. > > On Mon, May 18, 2020 at 3:19 PM Anna Povzner <a...@confluent.io> wrote: > > > Thanks everyone for the feedback. I will start a voting thread tomorrow > > morning if there are no more comments. > > > > Regards, > > Anna > > > > On Mon, May 18, 2020 at 2:06 PM Anna Povzner <a...@confluent.io> wrote: > > > >> Hi Boyang, > >> > >> This KIP does not change the protocol with clients. The behavior is the > >> same as with KIP-402 where the broker delays accepting new connections when > >> the limit for the number of connections is reached. This KIP adds another > >> reason for the delay (when the rate is reached). Similarly, when dropping a > >> connection when per-IP limit is reached, except this KIP delays the > >> response or may still accept the connection. Client may timeout on waiting > >> for connection, and retry. > >> > >> Thanks, > >> Anna > >> > >> On Mon, May 18, 2020 at 12:54 PM Boyang Chen <reluctanthero...@gmail.com> > >> wrote: > >> > >>> Hey Anna, > >>> > >>> thanks for the KIP. Will this change be applied as one type of quota > >>> violation, which for client side should be retriable? For EOS model > >>> before > >>> 2.6, the Streams client creates one producer for each input partition, so > >>> it is actually possible to create thousands of producers when the service > >>> is up. Just want to clarify what's the expected behavior to be seen on > >>> the > >>> client side? > >>> > >>> On Mon, May 18, 2020 at 12:04 PM Anna Povzner <a...@confluent.io> wrote: > >>> > >>> > Hi Alexandre, > >>> > > >>> > Thanks for your comments. My answers are below: > >>> > > >>> > 900. The KIP does not propose any new metrics because we already have > >>> > metrics that will let us monitor connection attempts and the amount of > >>> time > >>> > the broker delays accepting new connections: > >>> > 1. We have a per-listener (and per-processor) metric for connection > >>> > creation rate: > >>> > > >>> > > >>> kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate > >>> > 2. We have per-listener metrics that track the amount of time Acceptor > >>> is > >>> > blocked from accepting connections: > >>> > > >>> > > >>> kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name} > >>> > Note that adding per IP JMX metrics may end up adding a lot of > >>> overhead, > >>> > especially for clusters with a large number of clients and many > >>> different > >>> > IP addresses. If we ever want to add the metric, perhaps we could > >>> propose a > >>> > separate KIP, but that would require some more evaluation of potential > >>> > overhead. > >>> > > >>> > 901. Yes, I updated the wiki with the approach for enforcing per IP > >>> limits > >>> > (not dropping right away), as I described in my response to Rajini. > >>> > > >>> > 902. Any additional stress testing is always super useful. I am going > >>> to > >>> > have PR with the first half of the KIP ready soon (broker-wider and > >>> > per-listener limits). Perhaps it could be worthwhile to see if it makes > >>> > sense to add stress testing to muckrake tests. Also, check out > >>> connection > >>> > stress workloads in Trogdor and whether they are sufficient or could be > >>> > extended: > >>> > > >>> > > >>> https://github.com/apache/kafka/tree/trunk/tools/src/main/java/org/apache/kafka/trogdor/workload > >>> > > >>> > Regards, > >>> > Anna > >>> > > >>> > On Mon, May 18, 2020 at 8:57 AM Rajini Sivaram < > >>> rajinisiva...@gmail.com> > >>> > wrote: > >>> > > >>> > > Hi Anna, > >>> > > > >>> > > Thanks for the response, sounds good. > >>> > > > >>> > > Regards, > >>> > > > >>> > > Rajini > >>> > > > >>> > > > >>> > > On Sun, May 17, 2020 at 1:38 AM Anna Povzner <a...@confluent.io> > >>> wrote: > >>> > > > >>> > > > Hi Rajini, > >>> > > > > >>> > > > Thanks for reviewing the KIP! > >>> > > > > >>> > > > I agree with your suggestion to make per-IP connection rate quota a > >>> > > dynamic > >>> > > > quota for entity name IP. This will allow configuring connection > >>> rate > >>> > > for a > >>> > > > particular IP as well. I updated the wiki accordingly. > >>> > > > > >>> > > > Your second concern makes sense -- rejecting the connection right > >>> away > >>> > > will > >>> > > > likely cause a new connection from the same client. I am concerned > >>> > about > >>> > > > delaying new connections for processing later, because if the > >>> > connections > >>> > > > keep coming with the high rate, there may be potentially a large > >>> > backlog > >>> > > > and connections may start timing out before the broker gets to > >>> > processing > >>> > > > them. For example, if clients come through proxy, there may be > >>> > > > potentially a large number of incoming connections with the same > >>> IP. > >>> > > > > >>> > > > What do you think about the following option: > >>> > > > * Once per-IP connection rate reaches the limit, accept or drop > >>> (clean > >>> > > up) > >>> > > > the connection after a delay depending on whether the quota is > >>> still > >>> > > > violated. We could re-use the mechanism implemented with KIP-306 > >>> where > >>> > > the > >>> > > > broker delays the response for failed client authentication. The > >>> delay > >>> > > will > >>> > > > be set to min(delay calculated based on the rate quota, 1 second), > >>> > which > >>> > > > matches the max delay for request quota. > >>> > > > > >>> > > > I think this option is somewhat your suggestion with delaying > >>> accepting > >>> > > per > >>> > > > IP connections that reached the rate limit, but with protection in > >>> > place > >>> > > to > >>> > > > make sure the number of delayed connections does not blow up. What > >>> do > >>> > you > >>> > > > think? > >>> > > > > >>> > > > Thanks, > >>> > > > Anna > >>> > > > > >>> > > > On Sat, May 16, 2020 at 1:09 AM Alexandre Dupriez < > >>> > > > alexandre.dupr...@gmail.com> wrote: > >>> > > > > >>> > > > > Hi Anna, > >>> > > > > > >>> > > > > Thank you for your answers and explanations. > >>> > > > > > >>> > > > > A couple of additional comments: > >>> > > > > > >>> > > > > 900. KIP-612 does not intend to dedicate a metric to the > >>> throttling > >>> > of > >>> > > > > incoming connections. I wonder if such a metric would be handy > >>> for > >>> > > > > monitoring and help set-up metric-based alarming if one wishes to > >>> > > > > capture this type of incident? > >>> > > > > > >>> > > > > 901. Following-up on Rajini's point 2 above - from my > >>> understanding, > >>> > > > > this new quota should prevent excess CPU consumption in > >>> > > > > Processor#run() method when a new connection has been accepted. > >>> > > > > Through the throttling in place, connections will be delayed as > >>> > > > > indicated by the KIP's specifications: > >>> > > > > > >>> > > > > " If connection creation rate on the broker exceeds the > >>> broker-wide > >>> > > > > limit, the broker will delay accepting a new connection by an > >>> amount > >>> > > > > of time that brings the rate within the limit." > >>> > > > > > >>> > > > > You may be referring to the following sentence: > >>> > > > > > >>> > > > > "A new broker configuration option will be added to limit the > >>> rate at > >>> > > > > which connections will be accepted for each IP address. New > >>> > > > > connections for the IP will be dropped once the limit is > >>> reached."? > >>> > > > > > >>> > > > > 902. It may be interesting to capture the data with and without > >>> > > > > connection throttling under stress scenarios. You may have these > >>> data > >>> > > > > already. If you need a pair of hands to do some stress tests > >>> once you > >>> > > > > have a POC or a PR, I am happy to contribute :) > >>> > > > > > >>> > > > > Thanks, > >>> > > > > Alexandre > >>> > > > > > >>> > > > > Le ven. 15 mai 2020 à 12:22, Rajini Sivaram < > >>> rajinisiva...@gmail.com > >>> > > > >>> > > a > >>> > > > > écrit : > >>> > > > > > > >>> > > > > > Hi Anna, > >>> > > > > > > >>> > > > > > Thanks for the KIP, looks good overall. A couple of comments > >>> about > >>> > > > per-IP > >>> > > > > > connection quotas: > >>> > > > > > > >>> > > > > > 1) Should we consider making per-IP quota similar to other > >>> quotas? > >>> > > > > > Configured as a dynamic quota for entity type IP, with per-IP > >>> limit > >>> > > as > >>> > > > > well > >>> > > > > > as defaults? Perhaps that would fit better rather than configs? > >>> > > > > > > >>> > > > > > 2) The current proposal drops connections after accepting > >>> > connections > >>> > > > for > >>> > > > > > per-IP limit. We do this in other cases too, but in this case, > >>> > should > >>> > > > we > >>> > > > > > throttle instead? My point is what is the quota protecting? If > >>> we > >>> > > want > >>> > > > to > >>> > > > > > limit rate of accepted connections, then accepting a > >>> connection and > >>> > > > then > >>> > > > > > dropping doesn't really help since that IP is going to > >>> reconnect. > >>> > If > >>> > > we > >>> > > > > > want to rate limit what happens next, i.e. authentication, then > >>> > > > > > throttling the accepted connection so its processing is delayed > >>> > would > >>> > > > > > perhaps be better? > >>> > > > > > > >>> > > > > > Regards, > >>> > > > > > > >>> > > > > > Rajini > >>> > > > > > > >>> > > > > > On Thu, May 14, 2020 at 4:12 PM David Jacot < > >>> dja...@confluent.io> > >>> > > > wrote: > >>> > > > > > > >>> > > > > > > Hi Anna, > >>> > > > > > > > >>> > > > > > > Thanks for your answers and the updated KIP. Looks good to > >>> me! > >>> > > > > > > > >>> > > > > > > Best, > >>> > > > > > > David > >>> > > > > > > > >>> > > > > > > On Thu, May 14, 2020 at 12:54 AM Anna Povzner < > >>> a...@confluent.io > >>> > > > >>> > > > > wrote: > >>> > > > > > > > >>> > > > > > > > I updated the KIP to add a new broker configuration to > >>> limit > >>> > > > > connection > >>> > > > > > > > creation rate per IP: max.connection.creation.rate.per.ip. > >>> Once > >>> > > the > >>> > > > > limit > >>> > > > > > > > is reached for a particular IP address, the broker will > >>> reject > >>> > > the > >>> > > > > > > > connection from that IP (close the connection it accepted) > >>> and > >>> > > > > continue > >>> > > > > > > > rejecting them until the rate is back within the rate > >>> limit. > >>> > > > > > > > > >>> > > > > > > > On Wed, May 13, 2020 at 11:46 AM Anna Povzner < > >>> > a...@confluent.io > >>> > > > > >>> > > > > wrote: > >>> > > > > > > > > >>> > > > > > > > > Hi David and Alexandre, > >>> > > > > > > > > > >>> > > > > > > > > Thanks so much for your feedback! Here are my answers: > >>> > > > > > > > > > >>> > > > > > > > > 1. Yes, we have seen several cases of clients that > >>> create a > >>> > new > >>> > > > > > > > connection > >>> > > > > > > > > per produce/consume request. One hypothesis is someone > >>> who is > >>> > > > used > >>> > > > > to > >>> > > > > > > > > connection pooling may accidentally write a Kafka client > >>> that > >>> > > > > creates a > >>> > > > > > > > new > >>> > > > > > > > > connection every time. > >>> > > > > > > > > > >>> > > > > > > > > 2 & 4. That's a good point I haven't considered. I think > >>> it > >>> > > makes > >>> > > > > sense > >>> > > > > > > > to > >>> > > > > > > > > provide an ability to limit connection creations per IP > >>> as > >>> > > well. > >>> > > > > This > >>> > > > > > > is > >>> > > > > > > > > not hard to implement -- the broker already keeps track > >>> of > >>> > the > >>> > > > > number > >>> > > > > > > of > >>> > > > > > > > > connections per IP, and immediately closes a new > >>> connection > >>> > if > >>> > > it > >>> > > > > comes > >>> > > > > > > > > from an IP that reached the connection limit. So, we > >>> could > >>> > > > > additionally > >>> > > > > > > > > track the rate, and close the connection from IP that > >>> exceeds > >>> > > the > >>> > > > > rate. > >>> > > > > > > > One > >>> > > > > > > > > slight concern is whether keeping track of per IP rates > >>> and > >>> > > > quotas > >>> > > > > adds > >>> > > > > > > > > overhead (CPU and memory). But perhaps it is not a > >>> problem if > >>> > > we > >>> > > > > use > >>> > > > > > > > > expiring sensors. > >>> > > > > > > > > > >>> > > > > > > > > It would still make sense to limit the overall connection > >>> > > > creation > >>> > > > > rate > >>> > > > > > > > > for the Kafka clusters which are shared among many > >>> different > >>> > > > > > > > > applications/clients, since they may spike at the same > >>> time > >>> > > > > bringing > >>> > > > > > > the > >>> > > > > > > > > total rate too high. > >>> > > > > > > > > > >>> > > > > > > > > 3. Controlling connection queue sizes only controls the > >>> share > >>> > > of > >>> > > > > time > >>> > > > > > > > > network threads use for creating new connections (and > >>> > accepting > >>> > > > on > >>> > > > > > > > Acceptor > >>> > > > > > > > > thread) vs. doing other work on each Processor > >>> iteration. It > >>> > > does > >>> > > > > not > >>> > > > > > > > > directly control how processing connection creations > >>> would be > >>> > > > > related > >>> > > > > > > to > >>> > > > > > > > > other processing done by brokers like on request handler > >>> > > threads. > >>> > > > > So, > >>> > > > > > > > while > >>> > > > > > > > > controlling queue size may mitigate the issue for some > >>> of the > >>> > > > > > > workloads, > >>> > > > > > > > it > >>> > > > > > > > > does not guarantee that. Plus, if we want to limit how > >>> many > >>> > > > > connections > >>> > > > > > > > are > >>> > > > > > > > > created per IP, the queue size approach would not work, > >>> > unless > >>> > > we > >>> > > > > go > >>> > > > > > > > with a > >>> > > > > > > > > "share" of the queue, which I think even further obscures > >>> > what > >>> > > > that > >>> > > > > > > > setting > >>> > > > > > > > > means (and what we would achieve as an end result). Does > >>> this > >>> > > > > answer > >>> > > > > > > the > >>> > > > > > > > > question? > >>> > > > > > > > > > >>> > > > > > > > > If there are no objections, I will update the KIP to add > >>> per > >>> > IP > >>> > > > > > > > connection > >>> > > > > > > > > rate limits (config and enforcement). > >>> > > > > > > > > > >>> > > > > > > > > Thanks, > >>> > > > > > > > > > >>> > > > > > > > > Anna > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > On Tue, May 12, 2020 at 11:25 AM Alexandre Dupriez < > >>> > > > > > > > > alexandre.dupr...@gmail.com> wrote: > >>> > > > > > > > > > >>> > > > > > > > >> Hello, > >>> > > > > > > > >> > >>> > > > > > > > >> Thank you for the KIP. > >>> > > > > > > > >> > >>> > > > > > > > >> I experienced in the past genuine broker brownouts due > >>> to > >>> > > > > connection > >>> > > > > > > > >> storms consuming most of the CPU available on the server > >>> > and I > >>> > > > > think > >>> > > > > > > > >> it is useful to protect against it. > >>> > > > > > > > >> > >>> > > > > > > > >> I tend to share the questions asked in points 2 and 4 > >>> from > >>> > > > David. > >>> > > > > Is > >>> > > > > > > > >> there still a risk of denial of service if the limit > >>> applies > >>> > > at > >>> > > > > the > >>> > > > > > > > >> listener-level without differentiating between (an) > >>> > > “offending” > >>> > > > > > > > >> client(s) and the others? > >>> > > > > > > > >> > >>> > > > > > > > >> To rebound on point 3 - conceptually one difference > >>> between > >>> > > > > capping > >>> > > > > > > > >> the queue size or throttling as presented in the KIP > >>> would > >>> > > come > >>> > > > > from > >>> > > > > > > > >> the time it takes to accept a connection and how that > >>> time > >>> > > > evolves > >>> > > > > > > > >> with the connection rate. > >>> > > > > > > > >> Assuming that that time increases monotonically with > >>> > resource > >>> > > > > > > > >> utilization, the admissible rate of connections would > >>> > decrease > >>> > > > as > >>> > > > > the > >>> > > > > > > > >> server becomes more loaded, if the limit was set on > >>> queue > >>> > > size. > >>> > > > > > > > >> > >>> > > > > > > > >> Thanks, > >>> > > > > > > > >> Alexandre > >>> > > > > > > > >> > >>> > > > > > > > >> Le mar. 12 mai 2020 à 08:49, David Jacot < > >>> > dja...@confluent.io > >>> > > > > >>> > > > a > >>> > > > > > > écrit > >>> > > > > > > > : > >>> > > > > > > > >> > > >>> > > > > > > > >> > Hi Anna, > >>> > > > > > > > >> > > >>> > > > > > > > >> > Thanks for the KIP! I have few questions: > >>> > > > > > > > >> > > >>> > > > > > > > >> > 1. You mention that some clients may create a new > >>> > > connections > >>> > > > > for > >>> > > > > > > each > >>> > > > > > > > >> > requests: "Another example is clients that create a > >>> new > >>> > > > > connection > >>> > > > > > > for > >>> > > > > > > > >> each > >>> > > > > > > > >> > produce/consume request". I am curious here but do we > >>> know > >>> > > any > >>> > > > > > > clients > >>> > > > > > > > >> > behaving like this? > >>> > > > > > > > >> > > >>> > > > > > > > >> > 2. I am a bit concerned by the impact of misbehaving > >>> > clients > >>> > > > on > >>> > > > > the > >>> > > > > > > > >> other > >>> > > > > > > > >> > ones. Let's say that we define a quota of 10 > >>> connections / > >>> > > sec > >>> > > > > for a > >>> > > > > > > > >> broker > >>> > > > > > > > >> > and that we have a misbehaving application constantly > >>> > trying > >>> > > > to > >>> > > > > > > create > >>> > > > > > > > >> 20 > >>> > > > > > > > >> > connections on that broker. That application will > >>> > constantly > >>> > > > > hit the > >>> > > > > > > > >> quota > >>> > > > > > > > >> > and > >>> > > > > > > > >> > always have many pending connections in the queue > >>> waiting > >>> > to > >>> > > > be > >>> > > > > > > > >> accepted. > >>> > > > > > > > >> > Regular clients trying to connect would need to wait > >>> until > >>> > > all > >>> > > > > the > >>> > > > > > > > >> pending > >>> > > > > > > > >> > connections upfront in the queue are drained in the > >>> best > >>> > > case > >>> > > > > > > scenario > >>> > > > > > > > >> or > >>> > > > > > > > >> > won't be able to connect at all in the worst case > >>> scenario > >>> > > if > >>> > > > > the > >>> > > > > > > > queue > >>> > > > > > > > >> is > >>> > > > > > > > >> > full. > >>> > > > > > > > >> > Does it sound like a valid concern? How do you see > >>> this? > >>> > > > > > > > >> > > >>> > > > > > > > >> > 3. As you mention it in the KIP, we use bounded queues > >>> > which > >>> > > > > already > >>> > > > > > > > >> limit > >>> > > > > > > > >> > the maximum number of connections that can be > >>> accepted. I > >>> > > > > wonder if > >>> > > > > > > we > >>> > > > > > > > >> > could reach the same goal by making the size of the > >>> queue > >>> > > > > > > > configurable. > >>> > > > > > > > >> > > >>> > > > > > > > >> > 4. Did you consider doing something similar to the > >>> > > connections > >>> > > > > quota > >>> > > > > > > > >> which > >>> > > > > > > > >> > limits the number of connections per IP? Instead of > >>> rate > >>> > > > > limiting > >>> > > > > > > all > >>> > > > > > > > >> the > >>> > > > > > > > >> > creation, > >>> > > > > > > > >> > we could perhaps rate limit the number of creation > >>> per IP > >>> > as > >>> > > > > well. > >>> > > > > > > > That > >>> > > > > > > > >> > could > >>> > > > > > > > >> > perhaps reduce the effect on the other clients. That > >>> may > >>> > be > >>> > > > > harder > >>> > > > > > > to > >>> > > > > > > > >> > implement > >>> > > > > > > > >> > though. > >>> > > > > > > > >> > > >>> > > > > > > > >> > Best, > >>> > > > > > > > >> > David > >>> > > > > > > > >> > > >>> > > > > > > > >> > On Mon, May 11, 2020 at 7:58 PM Anna Povzner < > >>> > > > a...@confluent.io > >>> > > > > > > >>> > > > > > > > wrote: > >>> > > > > > > > >> > > >>> > > > > > > > >> > > Hi, > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > I just created KIP-612 to allow limiting connection > >>> > > creation > >>> > > > > rate > >>> > > > > > > on > >>> > > > > > > > >> > > brokers, and would like to start a discussion. > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > Feedback and suggestions are welcome! > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > Thanks, > >>> > > > > > > > >> > > Anna > >>> > > > > > > > >> > > > >>> > > > > > > > >> > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>