Hi Luke and Justine, Are there any thoughts or updates on this? I would love to help with this as we are hitting this more frequently now.
best, On Mon, Oct 31, 2022 at 6:15 PM Omnia Ibrahim <o.g.h.ibra...@gmail.com> wrote: > Hi Luke and Justine, > >> For (3), you said: >> > - I have some concerns about the impact of this option on the >> transactional >> producers, for example, what will happen to an ongoing transaction >> associated with an expired PID? Would this leave the transactions in a >> "hanging" state? >> >> - How will we notify the client that the transaction can't continue due to >> an expired PID? >> >> - If PID got marked as `expired` this will mean that >> `admin.DescribeProducers` will not list them which will make >> *`kafka-transactions.sh >> --list`* a bit tricky as we can't identify if there are transactions >> linked >> to this expired PID or not. The same concern applies to >> *`kafka-transactions.sh >> --find-hanging`*. >> >> --> Yes, you're right. Those are also concerns for this solution. >> Currently, there's no way to notify clients about the expiration. >> Also, the ongoing transactions will be hanging. For the admin cli, we've >> never thought about it. Good point. >> In summary, to adopt this solution, there are many issues needed to get >> fixed. >> > > Justin already clarified that if PID is attached to a transaction it will > not expire so identifying the transactions shouldn't be a concern anymore. > The only concern here will be that this solution will not solve the > problem if the rouge producer is a transactional producer with hanging > transactions. > If anyone faced this situation they will need to abort the hanging > transactions manually and then the solution to expire a PID can then work. > > --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear) >> Yes, We were thinking about throttling by KafkaPrinciple. Client Id is >> also >> workable. >> It's just these 2 attributes are not required. >> That is, it's possible we take all clients as the same one: {default >> KafkaPrinciple + default clientID}, and apply throttling on it. >> Do you have any thoughts about it? >> Maybe skip throttling for {default KafkaPrinciple + default clientID} >> > > Throttling for default KafkaPrinciple and default ClientID is useful when > we need to have a hard limit on the whole cluster and whoever is running > the cluster doesn't knowclientsntIDs or if a KafkaPrinciple is reused > between different producer applications. > I usually find it helpful to have a way to apply throttling only on the > rough clients only once I identify them without punishing everyone on the > cluster. However, there are two problems with this > - There's no easy way at the moment to link PIDs to clientId or > KafkaPrinciple. This need to be addressed first. > - Is Justin's comment on the throttling, and the fact that will mean we > either block all requests or have to store the request in memory which in > both cases has side downs on the producer experince. > > I recently had another discussion with my team and it does seem like there >> should be a way to make it more clear to the clients what is going on. A >> lot of this protocol is implicit. I'm wondering if maybe there is a way to >> improve the story for newer clients. (Ie if we choose to expire based on a >> size limit, we should include a response indicating the ID has expired.) >> We >> also discussed ways to redefine the guarantees so that users who have >> stronger idempotency requirements can ensure them (over >> availability/memory >> concerns). Let me know if you have any ideas here. >> > > It may be easier to improve the experience for new clients. However, if we > improved only the new clients we may need a way to help teams who run Kafka > with rough clients on old versions by at least giving them an easy way to > identify the clientId/ or KafkaPrinciple that generated these PIDs. > > For context, it's very tricky to even identify which clientId is creating > all these PIDs that caused OOM, which is a contributing part of the issue > at the moment. So maybe one option here could be adding a new metric that > tracks the number of generated PIDs per clientId. This will help the team > who runs the Kafka cluster to > - contact these rough clients and ask them to fix their clients or upgrade > to a new client if the new client version has a better experience. > - or if ended with a throttling solution this may help identify which > clientId needs to be throttled. > > Maybe we can start with a solution for identifying the rough clients first > and keep looking for a solution to limit them, what do you think? > > Thanks > > On Tue, Oct 18, 2022 at 5:24 PM Justine Olshan > <jols...@confluent.io.invalid> wrote: > >> Oops. I realized I just replied to Omnia 🤦♀️ >> >> Here was my response for the mailing thread: >> >> Hey Omnia, >> Sorry to hear this is a problem for you as well. :( >> > * I have some concerns about the impact of this option on the >> transactional producers, for example, what will happen to an ongoing >> transaction associated with an expired PID? Would this leave the >> transactions in a "hanging" state?* >> We currently check if a transaction is ongoing and do not expire the >> producer ID if it has an ongoing transaction. I suspect we will continue >> to >> do this with any solution we pick. >> >> My team members and I looked a bit into the throttling case and it can get >> a bit tricky since it means we need to throttle the produce request before >> it is processed. This means we either block all requests or have to store >> the request in memory (which is not great if we are trying to save >> memory). >> >> I recently had another discussion with my team and it does seem like there >> should be a way to make it more clear to the clients what is going on. A >> lot of this protocol is implicit. I'm wondering if maybe there is a way to >> improve the story for newer clients. (Ie if we choose to expire based on a >> size limit, we should include a response indicating the ID has expired.) >> We >> also discussed ways to redefine the guarantees so that users who have >> stronger idempotency requirements can ensure them (over >> availability/memory >> concerns). Let me know if you have any ideas here. >> >> Thanks again for commenting here, hopefully we can come to a good >> solution. >> >> On Tue, Oct 18, 2022 at 1:11 AM Luke Chen <show...@gmail.com> wrote: >> >> > Hi Omnia, >> > >> > Thanks for your reply. >> > >> > For (3), you said: >> > > - I have some concerns about the impact of this option on the >> > transactional >> > producers, for example, what will happen to an ongoing transaction >> > associated with an expired PID? Would this leave the transactions in a >> > "hanging" state? >> > >> > - How will we notify the client that the transaction can't continue due >> to >> > an expired PID? >> > >> > - If PID got marked as `expired` this will mean that >> > `admin.DescribeProducers` will not list them which will make >> > *`kafka-transactions.sh >> > --list`* a bit tricky as we can't identify if there are transactions >> linked >> > to this expired PID or not. The same concern applies to >> > *`kafka-transactions.sh >> > --find-hanging`*. >> > >> > --> Yes, you're right. Those are also concerns for this solution. >> > Currently, there's no way to notify clients about the expiration. >> > Also, the ongoing transactions will be hanging. For the admin cli, we've >> > never thought about it. Good point. >> > In summary, to adopt this solution, there are many issues needed to get >> > fixed. >> > >> > >> > For (5), you said: >> > > I am assuming you mean KafkaPrincipal here! If so is your concern here >> > that >> > those good clients that use the same principal as a rogue one will get >> > throttled? >> > >> > If this is the case, then I believe it should be okay as other >> throttling >> > in Kafka on *`/config/users/<user>`* has the same behaviour. >> > >> > >> > What about applying limit/throttling to >> > *`/config/users/<user>/clients/<client-id>` >> > *similar to what we have with client quota? This should reduce the >> concern >> > about throttling good clients, right? >> > >> > --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear) >> > Yes, We were thinking about throttling by KafkaPrinciple. Client Id is >> > also workable. >> > It's just these 2 attributes are not required. >> > That is, it's possible we take all clients as the same one: {default >> > KafkaPrinciple + default clientID}, and apply throttling on it. >> > Do you have any thoughts about it? >> > Maybe skip throttling for {default KafkaPrinciple + default clientID} ? >> > >> > Luke >> > >> > >> > >> > On Sat, Oct 15, 2022 at 2:33 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com> >> > wrote: >> > >> >> Hi Luke & Justine, >> >> Thanks for looking into this issue, we have been experiencing this >> because >> >> of rouge clients as well. >> >> >> >> > 3. Having a limit to the number of active producer IDs (sort of like >> an >> >> LRU >> >> >cache) >> >> >-> The idea here is that if we hit a misconfigured client, we will >> expire >> >> >the older entries. The concern here is we have risks to lose >> idempotency >> >> >guarantees, and currently, we don't have a way to notify clients about >> >> >losing idempotency guarantees. Besides, the least recently used >> entries >> >> >got removed are not always from the "bad" clients. >> >> >> >> - I have some concerns about the impact of this option on the >> >> transactional >> >> producers, for example, what will happen to an ongoing transaction >> >> associated with an expired PID? Would this leave the transactions in a >> >> "hanging" state? >> >> >> >> - How will we notify the client that the transaction can't continue >> due to >> >> an expired PID? >> >> >> >> - If PID got marked as `expired` this will mean that >> >> `admin.DescribeProducers` will not list them which will make >> >> *`kafka-transactions.sh >> >> --list`* a bit tricky as we can't identify if there are transactions >> >> linked >> >> to this expired PID or not. The same concern applies to >> >> *`kafka-transactions.sh >> >> --find-hanging`*. >> >> >> >> >> >> >5. limit/throttling the producer id based on the principle >> >> >-> Although we can limit the impact to a certain principle with this >> >> idea, >> >> >same concern still exists as solution #1 #2. >> >> >> >> I am assuming you mean KafkaPrincipal here! If so is your concern here >> >> that >> >> those good clients that use the same principal as a rogue one will get >> >> throttled? >> >> >> >> If this is the case, then I believe it should be okay as other >> throttling >> >> in Kafka on *`/config/users/<user>`* has the same behaviour. >> >> >> >> >> >> What about applying limit/throttling to >> >> *`/config/users/<user>/clients/<client-id>` >> >> *similar to what we have with client quota? This should reduce the >> concern >> >> about throttling good clients, right? >> >> >> >> best, >> >> >> >> Omnia >> >> >> >> On Tue, Oct 11, 2022 at 4:18 AM Luke Chen <show...@gmail.com> wrote: >> >> >> >> > Bump this thread to see if there are any comments/thoughts. >> >> > Thanks. >> >> > >> >> > Luke >> >> > >> >> > On Mon, Sep 26, 2022 at 11:06 AM Luke Chen <show...@gmail.com> >> wrote: >> >> > >> >> > > Hi devs, >> >> > > >> >> > > As stated in the motivation section in KIP-854 >> >> > > < >> >> > >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry >> >> > >: >> >> > > >> >> > > With idempotent producers becoming the default in Kafka, this means >> >> that >> >> > > unless otherwise specified, all new producers will be given >> producer >> >> IDs. >> >> > > Some (inefficient) applications may now create many >> non-transactional >> >> > > idempotent producers. Each of these producers will be assigned a >> >> producer >> >> > > ID and these IDs and their metadata are stored in the broker >> memory, >> >> > which >> >> > > might cause brokers out of memory. >> >> > > >> >> > > Justine (in cc.) and I and some other team members are working on >> the >> >> > > solutions for this issue. But none of them solves it completely >> >> without >> >> > > side effects. Among them, "availability" VS "idempotency >> guarantees" >> >> is >> >> > > what we can't decide which to sacrifice. Some of these solutions >> >> > sacrifice >> >> > > availability of produce (1,2,5) and others sacrifice idempotency >> >> > guarantees >> >> > > (3). It could be useful to know if people generally have a >> preference >> >> one >> >> > > way or the other. Or what other better solutions there might be. >> >> > > >> >> > > Here are the proposals we came up with: >> >> > > >> >> > > 1. Limit the total active producer ID allocation number. >> >> > > -> This is the simplest solution. But since the OOM issue is >> usually >> >> > > caused by a rogue or misconfigured client, and this solution might >> >> > "punish" >> >> > > the good client from sending messages. >> >> > > >> >> > > 2. Throttling the producer ID allocation rate >> >> > > -> Same concern as the solution #1. >> >> > > >> >> > > 3. Having a limit to the number of active producer IDs (sort of >> like >> >> an >> >> > > LRU cache) >> >> > > -> The idea here is that if we hit a misconfigured client, we will >> >> expire >> >> > > the older entries. The concern here is we have risks to lose >> >> idempotency >> >> > > guarantees, and currently, we don't have a way to notify clients >> about >> >> > > losing idempotency guarantees. Besides, the least recently used >> >> entries >> >> > > got removed are not always from the "bad" clients. >> >> > > >> >> > > 4. allow clients to "close" the producer ID usage >> >> > > -> We can provide a way for producer to "close" producerID usage. >> >> > > Currently, we only have a way to INIT_PRODUCER_ID requested to >> >> allocate >> >> > > one. After that, we'll keep the producer ID metadata in broker >> even if >> >> > the >> >> > > producer is "closed". Having a closed API (ex: END_PRODUCER_ID), we >> >> can >> >> > > remove the entry from broker side. In client side, we can send it >> when >> >> > > producer closing. The concern is, the old clients (including >> non-java >> >> > > clients) will still suffer from the OOM issue. >> >> > > >> >> > > 5. limit/throttling the producer id based on the principle >> >> > > -> Although we can limit the impact to a certain principle with >> this >> >> > idea, >> >> > > same concern still exists as solution #1 #2. >> >> > > >> >> > > Any thoughts/feedback are welcomed. >> >> > > >> >> > > Thank you. >> >> > > Luke >> >> > > >> >> > >> >> >> > >> >