Hi Luke and Justine,
Are there any thoughts or updates on this? I would love to help with this
as we are hitting this more frequently now.

best,

On Mon, Oct 31, 2022 at 6:15 PM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
wrote:

> Hi Luke and Justine,
>
>> For (3), you said:
>> > - I have some concerns about the impact of this option on the
>> transactional
>> producers, for example, what will happen to an ongoing transaction
>> associated with an expired PID? Would this leave the transactions in a
>> "hanging" state?
>>
>> - How will we notify the client that the transaction can't continue due to
>> an expired PID?
>>
>> - If PID got marked as `expired` this will mean that
>> `admin.DescribeProducers` will not list them which will make
>> *`kafka-transactions.sh
>> --list`* a bit tricky as we can't identify if there are transactions
>> linked
>> to this expired PID or not. The same concern applies to
>> *`kafka-transactions.sh
>> --find-hanging`*.
>>
>> --> Yes, you're right. Those are also concerns for this solution.
>> Currently, there's no way to notify clients about the expiration.
>> Also, the ongoing transactions will be hanging. For the admin cli, we've
>> never thought about it. Good point.
>> In summary, to adopt this solution, there are many issues needed to get
>> fixed.
>>
>
> Justin already clarified that if PID is attached to a transaction it will
> not expire so identifying the transactions shouldn't be a concern anymore.
> The only concern here will be that this solution will not solve the
> problem if the rouge producer is a transactional producer with hanging
> transactions.
> If anyone faced this situation they will need to abort the hanging
> transactions manually and then the solution to expire a PID can then work.
>
> --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
>> Yes, We were thinking about throttling by KafkaPrinciple. Client Id is
>> also
>> workable.
>> It's just these 2 attributes are not required.
>> That is, it's possible we take all clients as the same one: {default
>> KafkaPrinciple + default clientID}, and apply throttling on it.
>> Do you have any thoughts about it?
>> Maybe skip throttling for {default KafkaPrinciple + default clientID}
>>
>
> Throttling for default KafkaPrinciple and default ClientID is useful when
> we need to have a hard limit on the whole cluster and whoever is running
> the cluster doesn't knowclientsntIDs or if a KafkaPrinciple is reused
> between different producer applications.
> I usually find it helpful to have a way to apply throttling only on the
> rough clients only once I identify them without punishing everyone on the
> cluster. However, there are two problems with this
> - There's no easy way at the moment to link PIDs to clientId or
> KafkaPrinciple. This need to be addressed first.
> - Is Justin's comment on the throttling, and the fact that will mean we
> either block all requests or have to store the request in memory which in
> both cases has side downs on the producer experince.
>
> I recently had another discussion with my team and it does seem like there
>> should be a way to make it more clear to the clients what is going on. A
>> lot of this protocol is implicit. I'm wondering if maybe there is a way to
>> improve the story for newer clients. (Ie if we choose to expire based on a
>> size limit, we should include a response indicating the ID has expired.)
>> We
>> also discussed ways to redefine the guarantees so that users who have
>> stronger idempotency requirements can ensure them (over
>> availability/memory
>> concerns). Let me know if you have any ideas here.
>>
>
> It may be easier to improve the experience for new clients. However, if we
> improved only the new clients we may need a way to help teams who run Kafka
> with rough clients on old versions by at least giving them an easy way to
> identify the clientId/ or KafkaPrinciple that generated these PIDs.
>
> For context, it's very tricky to even identify which clientId is creating
> all these PIDs that caused OOM, which is a contributing part of the issue
> at the moment. So maybe one option here could be adding a new metric that
> tracks the number of generated PIDs per clientId. This will help the team
> who runs the Kafka cluster to
> - contact these rough clients and ask them to fix their clients or upgrade
> to a new client if the new client version has a better experience.
> - or if ended with a throttling solution this may help identify which
> clientId needs to be throttled.
>
> Maybe we can start with a solution for identifying the rough clients first
> and keep looking for a solution to limit them, what do you think?
>
> Thanks
>
> On Tue, Oct 18, 2022 at 5:24 PM Justine Olshan
> <jols...@confluent.io.invalid> wrote:
>
>> Oops.  I realized I just replied to Omnia 🤦‍♀️
>>
>> Here was my response for the mailing thread:
>>
>> Hey Omnia,
>> Sorry to hear this is a problem for you as well. :(
>> > * I have some concerns about the impact of this option on the
>> transactional producers, for example, what will happen to an ongoing
>> transaction associated with an expired PID? Would this leave the
>> transactions in a "hanging" state?*
>> We currently check if a transaction is ongoing and do not expire the
>> producer ID if it has an ongoing transaction. I suspect we will continue
>> to
>> do this with any solution we pick.
>>
>> My team members and I looked a bit into the throttling case and it can get
>> a bit tricky since it means we need to throttle the produce request before
>> it is processed. This means we either block all requests or have to store
>> the request in memory (which is not great if we are trying to save
>> memory).
>>
>> I recently had another discussion with my team and it does seem like there
>> should be a way to make it more clear to the clients what is going on. A
>> lot of this protocol is implicit. I'm wondering if maybe there is a way to
>> improve the story for newer clients. (Ie if we choose to expire based on a
>> size limit, we should include a response indicating the ID has expired.)
>> We
>> also discussed ways to redefine the guarantees so that users who have
>> stronger idempotency requirements can ensure them (over
>> availability/memory
>> concerns). Let me know if you have any ideas here.
>>
>> Thanks again for commenting here, hopefully we can come to a good
>> solution.
>>
>> On Tue, Oct 18, 2022 at 1:11 AM Luke Chen <show...@gmail.com> wrote:
>>
>> > Hi Omnia,
>> >
>> > Thanks for your reply.
>> >
>> > For (3), you said:
>> > > - I have some concerns about the impact of this option on the
>> > transactional
>> > producers, for example, what will happen to an ongoing transaction
>> > associated with an expired PID? Would this leave the transactions in a
>> > "hanging" state?
>> >
>> > - How will we notify the client that the transaction can't continue due
>> to
>> > an expired PID?
>> >
>> > - If PID got marked as `expired` this will mean that
>> > `admin.DescribeProducers` will not list them which will make
>> > *`kafka-transactions.sh
>> > --list`* a bit tricky as we can't identify if there are transactions
>> linked
>> > to this expired PID or not. The same concern applies to
>> > *`kafka-transactions.sh
>> > --find-hanging`*.
>> >
>> > --> Yes, you're right. Those are also concerns for this solution.
>> > Currently, there's no way to notify clients about the expiration.
>> > Also, the ongoing transactions will be hanging. For the admin cli, we've
>> > never thought about it. Good point.
>> > In summary, to adopt this solution, there are many issues needed to get
>> > fixed.
>> >
>> >
>> > For (5), you said:
>> > > I am assuming you mean KafkaPrincipal here! If so is your concern here
>> > that
>> > those good clients that use the same principal as a rogue one will get
>> > throttled?
>> >
>> > If this is the case, then I believe it should be okay as other
>> throttling
>> > in Kafka on *`/config/users/<user>`* has the same behaviour.
>> >
>> >
>> > What about applying limit/throttling to
>> > *`/config/users/<user>/clients/<client-id>`
>> > *similar to what we have with client quota? This should reduce the
>> concern
>> > about throttling good clients, right?
>> >
>> > --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
>> > Yes, We were thinking about throttling by KafkaPrinciple. Client Id is
>> > also workable.
>> > It's just these 2 attributes are not required.
>> > That is, it's possible we take all clients as the same one: {default
>> > KafkaPrinciple + default clientID}, and apply throttling on it.
>> > Do you have any thoughts about it?
>> > Maybe skip throttling for {default KafkaPrinciple + default clientID} ?
>> >
>> > Luke
>> >
>> >
>> >
>> > On Sat, Oct 15, 2022 at 2:33 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>> > wrote:
>> >
>> >> Hi Luke & Justine,
>> >> Thanks for looking into this issue, we have been experiencing this
>> because
>> >> of rouge clients as well.
>> >>
>> >> > 3. Having a limit to the number of active producer IDs (sort of like
>> an
>> >> LRU
>> >> >cache)
>> >> >-> The idea here is that if we hit a misconfigured client, we will
>> expire
>> >> >the older entries. The concern here is we have risks to lose
>> idempotency
>> >> >guarantees, and currently, we don't have a way to notify clients about
>> >> >losing idempotency guarantees. Besides, the least  recently used
>> entries
>> >> >got removed are not always from the "bad" clients.
>> >>
>> >> - I have some concerns about the impact of this option on the
>> >> transactional
>> >> producers, for example, what will happen to an ongoing transaction
>> >> associated with an expired PID? Would this leave the transactions in a
>> >> "hanging" state?
>> >>
>> >> - How will we notify the client that the transaction can't continue
>> due to
>> >> an expired PID?
>> >>
>> >> - If PID got marked as `expired` this will mean that
>> >> `admin.DescribeProducers` will not list them which will make
>> >> *`kafka-transactions.sh
>> >> --list`* a bit tricky as we can't identify if there are transactions
>> >> linked
>> >> to this expired PID or not. The same concern applies to
>> >> *`kafka-transactions.sh
>> >> --find-hanging`*.
>> >>
>> >>
>> >> >5. limit/throttling the producer id based on the principle
>> >> >-> Although we can limit the impact to a certain principle with this
>> >> idea,
>> >> >same concern still exists as solution #1 #2.
>> >>
>> >> I am assuming you mean KafkaPrincipal here! If so is your concern here
>> >> that
>> >> those good clients that use the same principal as a rogue one will get
>> >> throttled?
>> >>
>> >> If this is the case, then I believe it should be okay as other
>> throttling
>> >> in Kafka on *`/config/users/<user>`* has the same behaviour.
>> >>
>> >>
>> >> What about applying limit/throttling to
>> >> *`/config/users/<user>/clients/<client-id>`
>> >> *similar to what we have with client quota? This should reduce the
>> concern
>> >> about throttling good clients, right?
>> >>
>> >> best,
>> >>
>> >> Omnia
>> >>
>> >> On Tue, Oct 11, 2022 at 4:18 AM Luke Chen <show...@gmail.com> wrote:
>> >>
>> >> > Bump this thread to see if there are any comments/thoughts.
>> >> > Thanks.
>> >> >
>> >> > Luke
>> >> >
>> >> > On Mon, Sep 26, 2022 at 11:06 AM Luke Chen <show...@gmail.com>
>> wrote:
>> >> >
>> >> > > Hi devs,
>> >> > >
>> >> > > As stated in the motivation section in KIP-854
>> >> > > <
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
>> >> > >:
>> >> > >
>> >> > > With idempotent producers becoming the default in Kafka, this means
>> >> that
>> >> > > unless otherwise specified, all new producers will be given
>> producer
>> >> IDs.
>> >> > > Some (inefficient) applications may now create many
>> non-transactional
>> >> > > idempotent producers. Each of these producers will be assigned a
>> >> producer
>> >> > > ID and these IDs and their metadata are stored in the broker
>> memory,
>> >> > which
>> >> > > might cause brokers out of memory.
>> >> > >
>> >> > > Justine (in cc.) and I and some other team members are working on
>> the
>> >> > > solutions for this issue. But none of them solves it completely
>> >> without
>> >> > > side effects. Among them, "availability" VS "idempotency
>> guarantees"
>> >> is
>> >> > > what we can't decide which to sacrifice. Some of these solutions
>> >> > sacrifice
>> >> > > availability of produce (1,2,5) and others sacrifice idempotency
>> >> > guarantees
>> >> > > (3). It could be useful to know if people generally have a
>> preference
>> >> one
>> >> > > way or the other. Or what other better solutions there might be.
>> >> > >
>> >> > > Here are the proposals we came up with:
>> >> > >
>> >> > > 1. Limit the total active producer ID allocation number.
>> >> > > -> This is the simplest solution. But since the OOM issue is
>> usually
>> >> > > caused by a rogue or misconfigured client, and this solution might
>> >> > "punish"
>> >> > > the good client from sending messages.
>> >> > >
>> >> > > 2. Throttling the producer ID allocation rate
>> >> > > -> Same concern as the solution #1.
>> >> > >
>> >> > > 3. Having a limit to the number of active producer IDs (sort of
>> like
>> >> an
>> >> > > LRU cache)
>> >> > > -> The idea here is that if we hit a misconfigured client, we will
>> >> expire
>> >> > > the older entries. The concern here is we have risks to lose
>> >> idempotency
>> >> > > guarantees, and currently, we don't have a way to notify clients
>> about
>> >> > > losing idempotency guarantees. Besides, the least  recently used
>> >> entries
>> >> > > got removed are not always from the "bad" clients.
>> >> > >
>> >> > > 4. allow clients to "close" the producer ID usage
>> >> > > -> We can provide a way for producer to "close" producerID usage.
>> >> > > Currently, we only have a way to INIT_PRODUCER_ID requested to
>> >> allocate
>> >> > > one. After that, we'll keep the producer ID metadata in broker
>> even if
>> >> > the
>> >> > > producer is "closed". Having a closed API (ex: END_PRODUCER_ID), we
>> >> can
>> >> > > remove the entry from broker side. In client side, we can send it
>> when
>> >> > > producer closing. The concern is, the old clients (including
>> non-java
>> >> > > clients) will still suffer from the OOM issue.
>> >> > >
>> >> > > 5. limit/throttling the producer id based on the principle
>> >> > > -> Although we can limit the impact to a certain principle with
>> this
>> >> > idea,
>> >> > > same concern still exists as solution #1 #2.
>> >> > >
>> >> > > Any thoughts/feedback are welcomed.
>> >> > >
>> >> > > Thank you.
>> >> > > Luke
>> >> > >
>> >> >
>> >>
>> >
>>
>

Reply via email to