Hey all,
Quick update to this KIP. While working on the PR and tests I realized that
we have a hardcoded value for how often we clean up producer IDs. Currently
this value is 10 minutes and is defined in LogManager.
I thought for better testing and ease of use of the new configuration, we
should also be able to configure the cleanup interval.

Here is the new configuration I'm hoping to add. I also added it to the KIP.
name: producer.id.expiration.check.interval.ms
description: The interval at which to remove producer IDs that have expired
due to producer.id.expiration.ms passing
default: 600000 (10 minutes)
valid values: [1,...]
priority: low
update-mode: read-only

I left the default as the current hardcoded value to avoid disruptions. If
there are any issues with this change let me know.
Thanks,
Justine

On Fri, Aug 5, 2022 at 1:40 PM Justine Olshan <jols...@confluent.io> wrote:

> Awesome. Thanks Tom!
>
> I plan to open this KIP vote at the start of next week.
> Thanks all for the discussion! Let me know if there is anything else. :)
>
> Justine
>
> On Wed, Aug 3, 2022 at 11:32 AM Tom Bentley <tbent...@redhat.com> wrote:
>
>> Hi Justine,
>>
>> That all seems reasonable to me, thanks!
>>
>> On Wed, 3 Aug 2022 at 19:14, Justine Olshan <jols...@confluent.io.invalid
>> >
>> wrote:
>>
>> > Hi Tom and Ismael,
>> >
>> > 1. Yes, there are definitely many ways to improve this issue and I plan
>> to
>> > write followup KIPs to address some of the larger changes.
>> > Just wanted to get this simple fix in as a short term measure to prevent
>> > issues with too many producer IDs in the cache. Stay tuned :)
>> >
>> > 2. I did have some offline discussion about informing the client. I
>> think
>> > for this specific KIP the default behavior in practice should not change
>> > enough to require this information to go back to the client. In other
>> > words, a reasonable configuration should not regress behavior. However,
>> > with the further changes I mention in 1, perhaps this is something we
>> want
>> > to do. And yes -- unfortunately the current state of Kafka is no longer
>> > totally consistent with KIP-98. This is something we probably want to
>> > clarify in the future.
>> >
>> > 3. I will update the config to mention it is not dynamic. I think since
>> the
>> > transactional id configuration is read-only, this should be too.
>> >
>> > 4. I can update this wording.
>> >
>> > 5. I think there are definitely benefits to the name `
>> > idempotent.pid.expiration.ms` but there are other ways this could cause
>> > confusion. And to be clear -- the configuration can expire a producer ID
>> > for a transactional producer as long as there isn't an ongoing
>> transaction.
>> >
>> > Let me know if you have any questions and thanks for taking a look!
>> >
>> > Justine
>> >
>> > On Wed, Aug 3, 2022 at 9:30 AM Ismael Juma <ism...@juma.me.uk> wrote:
>> >
>> > > Regarding 1, more can certainly be done, but I think it would be
>> > > complementary. As such, I think this KIP stands on its own and
>> additional
>> > > improvements can be handled via future KIPs (unless Justine wants to
>> > > combine things, of course).
>> > >
>> > > Ismael
>> > >
>> > > On Wed, Aug 3, 2022 at 9:12 AM Tom Bentley <tbent...@redhat.com>
>> wrote:
>> > >
>> > > > Hi Justine,
>> > > >
>> > > > Thanks for the KIP! I can see that this is a pragmatic attempt to
>> > > address a
>> > > > nasty problem. I have a few questions:
>> > > >
>> > > > 1. The KIP makes the problem significantly harder to trigger, but
>> > doesn't
>> > > > eliminate it entirely. How confident are you that it will be
>> sufficient
>> > > in
>> > > > practice? We can point to applications which are creating idempotent
>> > > > producers at a high rate and say they're broken, but that doesn't do
>> > > > anything to defend the broker from an interaction pattern that
>> differs
>> > > only
>> > > > in rate from a "good application". Did you consider a new quota to
>> > limit
>> > > > the rate at which a (principal, clientId) can allocate new PIDs?
>> > > >
>> > > > 2. The KIP contains this sentence: "when an idempotent producer’s ID
>> > > > expires, it silently loses its idempotency guarantees." That's at
>> odds
>> > > with
>> > > > my reading of "PID expiration" in the KIP-98 design[1], but it does
>> > seem
>> > > > consistent with a (brief!) look at the code. I accept that the risk
>> > > should
>> > > > be minimal so long as the expiration time is > the producer's
>> delivery
>> > > > timeout, but it would still be nice if we could detect this
>> situation
>> > and
>> > > > return an error to the client. Is there a reason for the apparent
>> > > deviation
>> > > > from KIP-98 (or am I misreading the code?)
>> > > >
>> > > > 3. Could the KIP be explicit on whether the new config will be
>> > > dynamically
>> > > > changeable?
>> > > >
>> > > > 4. The description of producer.id.expiration.ms mentions the
>> > > > ProducerStateManager, which will mean nothing to a normal user. We
>> > could
>> > > > probably change it to "a topic partition leader" without loss of
>> > meaning.
>> > > >
>> > > > 5. The description also says "Producer IDs will not expire while a
>> > > > transaction associated to them is still ongoing." To me this
>> suggests
>> > > that
>> > > > a more intuitive name for this config (from the user PoV) would
>> include
>> > > > "idempotent", since it does not cover the transactional case. (I
>> would
>> > > > suggest "idempotent.pid.expiration.ms" (c.f.
>> > > > transactional.id.expiration.ms),
>> > > > but the distinction between "id" and "pid" is easily missed–even if
>> > it's
>> > > > technically correct–so I'm not sure it's better than what you're
>> > > > proposing). I only mention it in case it prompts someone else to
>> find a
>> > > > better name.
>> > > >
>> > > > Thanks again,
>> > > >
>> > > > Tom
>> > > >
>> > > > [1]:
>> > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit#heading=h.loujdamc9ptj
>> > > >
>> > > > On Tue, 2 Aug 2022 at 22:00, Justine Olshan
>> > <jols...@confluent.io.invalid
>> > > >
>> > > > wrote:
>> > > >
>> > > > > I've updated the KIP to make the new minimum value 1 and remove
>> the
>> > -1
>> > > > > configuration.
>> > > > > I've also added the low priority to the configuration and edited
>> the
>> > > > > description as Ismael mentioned.
>> > > > >
>> > > > > I'm thinking about bringing this KIP to a vote soon! Let me know
>> if
>> > > there
>> > > > > are any other comments or questions.
>> > > > >
>> > > > > Thanks,
>> > > > > Justine
>> > > > >
>> > > > > On Tue, Aug 2, 2022 at 9:02 AM Jason Gustafson
>> > > > <ja...@confluent.io.invalid
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > I agree with Ismael that we should remove -1. I think we tend to
>> > view
>> > > > the
>> > > > > > coupling of these behaviors into a single configuration as a
>> > mistake,
>> > > > so
>> > > > > > it's a little odd to keep it (even if in a weakened form).
>> > > > > >
>> > > > > > -Jason
>> > > > > >
>> > > > > > On Sat, Jul 30, 2022 at 7:37 AM Ismael Juma <ism...@juma.me.uk>
>> > > wrote:
>> > > > > >
>> > > > > > > I would remove -1 altogether. Two more comments:
>> > > > > > >
>> > > > > > > 1. The current description kind of leads people towards
>> aligning
>> > > the
>> > > > > > config
>> > > > > > > with delivery.timeout.ms. Is that what we want? We could say
>> it
>> > > > should
>> > > > > > be
>> > > > > > > higher than delivery.timeout.ms but indicate that the
>> default is
>> > > > > usually
>> > > > > > > fine. The main reason to reduce it would be to save memory, I
>> > > guess.
>> > > > > > > 2. Each config has a priority, we should specify it for this
>> one.
>> > > I'm
>> > > > > > > assuming it will be "low".
>> > > > > > >
>> > > > > > > Ismael
>> > > > > > >
>> > > > > > > On Fri, Jul 29, 2022 at 2:38 PM Justine Olshan
>> > > > > > > <jols...@confluent.io.invalid>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > I've updated the KIP to include the new default of 1 day and
>> > > > > > information
>> > > > > > > > about -1 in the description of the config.
>> > > > > > > > I wonder though if including -1 makes sense now that it is
>> not
>> > > the
>> > > > > > > default
>> > > > > > > > value. Is there a benefit for manually setting -1 vs
>> manually
>> > > > setting
>> > > > > > the
>> > > > > > > > value that transactional.id.expiration.ms has?
>> > > > > > > >
>> > > > > > > > Let me know your thoughts.
>> > > > > > > > Thanks,
>> > > > > > > > Justine
>> > > > > > > >
>> > > > > > > > On Fri, Jul 29, 2022 at 10:54 AM Ismael Juma <
>> > ism...@juma.me.uk>
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > +1 for having 1 day as the default and for including this
>> > > change
>> > > > in
>> > > > > > the
>> > > > > > > > > release notes.
>> > > > > > > > >
>> > > > > > > > > Ismael
>> > > > > > > > >
>> > > > > > > > > On Wed, Jul 27, 2022 at 9:16 AM Jason Gustafson
>> > > > > > > > <ja...@confluent.io.invalid
>> > > > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > I don't think a major release is a requirement for a
>> > default
>> > > > > change
>> > > > > > > for
>> > > > > > > > > > what it's worth. I do appreciate that there is a
>> preference
>> > > for
>> > > > > not
>> > > > > > > > > rocking
>> > > > > > > > > > the boat though. For a little bit of background here,
>> the
>> > > > problem
>> > > > > > we
>> > > > > > > > > > have encountered in production since the idempotent
>> > producer
>> > > > > became
>> > > > > > > the
>> > > > > > > > > > default is OOM errors due to huge numbers of producerIds
>> > that
>> > > > had
>> > > > > > to
>> > > > > > > be
>> > > > > > > > > > retained in the cache for 7 days. It is hard to prevent
>> use
>> > > > cases
>> > > > > > > from
>> > > > > > > > > > emerging where producers are used and discarded
>> rapidly. We
>> > > > will
>> > > > > be
>> > > > > > > > > using a
>> > > > > > > > > > lower value for sure, but it would also be nice to
>> reduce
>> > the
>> > > > > > > > likelihood
>> > > > > > > > > > for the community to see this problem. The benefit of
>> the
>> > > > caching
>> > > > > > > > > > diminishes quickly over time since it is primarily
>> meant to
>> > > > > handle
>> > > > > > > > > producer
>> > > > > > > > > > retry windows. I do not think there is much difference
>> > > between
>> > > > 1
>> > > > > > days
>> > > > > > > > > and 7
>> > > > > > > > > > days from an application perspective, but it is a huge
>> > > > difference
>> > > > > > for
>> > > > > > > > the
>> > > > > > > > > > broker's memory usage.
>> > > > > > > > > >
>> > > > > > > > > > Best,
>> > > > > > > > > > Jason
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Jul 27, 2022 at 2:57 AM Sagar <
>> > > > sagarmeansoc...@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Thanks Justine for the KIP. I think it might be
>> better to
>> > > > > > document
>> > > > > > > > the
>> > > > > > > > > > > correlation between the new config and
>> > delivery.timeout.ms
>> > > > in
>> > > > > > the
>> > > > > > > > > Public
>> > > > > > > > > > > Interfaces Description.
>> > > > > > > > > > >
>> > > > > > > > > > > Also, I agree with Luke that for now setting a
>> default to
>> > > -1
>> > > > > > should
>> > > > > > > > be
>> > > > > > > > > > > good. We can look to switch to 1 day with major
>> release.
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks!
>> > > > > > > > > > > Sagar.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Jul 27, 2022 at 9:05 AM Luke Chen <
>> > > show...@gmail.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi Justine,
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks for the KIP.
>> > > > > > > > > > > > I agree with you that we should try our best to keep
>> > > > backward
>> > > > > > > > > > > > compatibility, although our intention is to have
>> lower
>> > > > > producer
>> > > > > > > id
>> > > > > > > > > > > > expiration timeout.
>> > > > > > > > > > > > So, I think we should keep default to -1 IMO.
>> > > > > > > > > > > > Maybe we change the default to 1 day in next major
>> > > release
>> > > > > > (4.0)?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thank you.
>> > > > > > > > > > > > Luke
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Jul 27, 2022 at 7:13 AM Justine Olshan
>> > > > > > > > > > > > <jols...@confluent.io.invalid>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks for taking a look Jason!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I wondered if we wanted to have a smaller default
>> but
>> > > > > wasn't
>> > > > > > > sure
>> > > > > > > > > > about
>> > > > > > > > > > > > the
>> > > > > > > > > > > > > compatibility story -- especially since there is
>> the
>> > > > chance
>> > > > > > for
>> > > > > > > > > > > producer
>> > > > > > > > > > > > > IDs to expire silently.
>> > > > > > > > > > > > > I do think that 1 day is fairly reasonable. If I
>> > don't
>> > > > hear
>> > > > > > any
>> > > > > > > > > > > > conflicting
>> > > > > > > > > > > > > opinions I can go ahead and update the default.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Justine
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Tue, Jul 26, 2022 at 12:23 PM Jason Gustafson
>> > > > > > > > > > > > > <ja...@confluent.io.invalid>
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Justine,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thanks for the KIP. Although I hate seeing new
>> > > > > > > configurations,
>> > > > > > > > I
>> > > > > > > > > > > think
>> > > > > > > > > > > > > this
>> > > > > > > > > > > > > > is a good change. Combining these timeout
>> behaviors
>> > > > into
>> > > > > a
>> > > > > > > > single
>> > > > > > > > > > > > > > configuration was definitely a mistake, but we
>> > didn't
>> > > > > > > > anticipate
>> > > > > > > > > > the
>> > > > > > > > > > > > > > problem with the producer id cache. I do wonder
>> if
>> > we
>> > > > can
>> > > > > > > make
>> > > > > > > > > the
>> > > > > > > > > > > > > default
>> > > > > > > > > > > > > > a bit lower to reduce the chances that users
>> will
>> > hit
>> > > > the
>> > > > > > > same
>> > > > > > > > > > memory
>> > > > > > > > > > > > > > issues we have seen. After decoupling this
>> > > > configuration
>> > > > > > from
>> > > > > > > > > > > > > > transactional.id.expiration.ms, the new timeout
>> > just
>> > > > > needs
>> > > > > > > to
>> > > > > > > > > > cover
>> > > > > > > > > > > > the
>> > > > > > > > > > > > > > longest duration that a producer might be
>> retrying
>> > > the
>> > > > > same
>> > > > > > > > > Produce
>> > > > > > > > > > > > > > request. 7 days seems too high. Although I
>> think it
>> > > > could
>> > > > > > go
>> > > > > > > a
>> > > > > > > > > fair
>> > > > > > > > > > > > even
>> > > > > > > > > > > > > > lower, perhaps 1 day is a reasonable place to
>> > start?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > Jason
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Mon, Jul 25, 2022 at 9:25 AM Justine Olshan
>> > > > > > > > > > > > > > <jols...@confluent.io.invalid>
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hey Bill,
>> > > > > > > > > > > > > > > Thanks! I was just going to say that hopefully
>> > > > > > > > > > > > > > > transactional.id.expiration.ms would also be
>> > over
>> > > > the
>> > > > > > > > delivery
>> > > > > > > > > > > > > timeout.
>> > > > > > > > > > > > > > :)
>> > > > > > > > > > > > > > > Thanks for the +1!
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Justine
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Mon, Jul 25, 2022 at 9:17 AM Bill Bejeck <
>> > > > > > > > bbej...@gmail.com
>> > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hi Justine,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I just took another look at the KIP, and I
>> > > realize
>> > > > my
>> > > > > > > > > > > > > > question/suggestion
>> > > > > > > > > > > > > > > > about default values has already been
>> addressed
>> > > in
>> > > > > the
>> > > > > > > > > > > > > `Compatibility`
>> > > > > > > > > > > > > > > > section.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I'm +1 on the KIP.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > -Bill
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Thu, Jul 21, 2022 at 6:20 PM Bill Bejeck
>> <
>> > > > > > > > > bbej...@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hi Justine,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks for the well written KIP, this
>> looks
>> > > like
>> > > > it
>> > > > > > > will
>> > > > > > > > > be a
>> > > > > > > > > > > > > useful
>> > > > > > > > > > > > > > > > > addition.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Overall the KIP looks good to me, I have
>> one
>> > > > > > > > > > question/comment.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > You mentioned that setting the `
>> > > > > > > > producer.id.expiration.ms`
>> > > > > > > > > > > less
>> > > > > > > > > > > > > than
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > delivery timeout could lead to duplicates,
>> > > which
>> > > > > > makes
>> > > > > > > > > sense.
>> > > > > > > > > > > To
>> > > > > > > > > > > > > > help
>> > > > > > > > > > > > > > > > > avoid this situation, do we want to
>> consider
>> > a
>> > > > > > default
>> > > > > > > > > value
>> > > > > > > > > > > that
>> > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > same as the delivery timeout?
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks again for the KIP.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Bill
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > On Thu, Jul 21, 2022 at 4:54 PM Justine
>> > Olshan
>> > > > > > > > > > > > > > > > > <jols...@confluent.io.invalid> wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >> Hey all!
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >> I'd like to start a discussion on my
>> > proposal
>> > > to
>> > > > > > > > separate
>> > > > > > > > > > > > > time-based
>> > > > > > > > > > > > > > > > >> producer ID expiration from
>> transactional ID
>> > > > > > > expiration
>> > > > > > > > by
>> > > > > > > > > > > > > > > introducing a
>> > > > > > > > > > > > > > > > >> new configuration.
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >> The KIP Is pretty small and simple, but
>> will
>> > > be
>> > > > > > > helpful
>> > > > > > > > in
>> > > > > > > > > > > > > > controlling
>> > > > > > > > > > > > > > > > >> memory usage in brokers -- especially now
>> > that
>> > > > by
>> > > > > > > > default
>> > > > > > > > > > > > > producers
>> > > > > > > > > > > > > > > are
>> > > > > > > > > > > > > > > > >> idempotent and create producer ID state.
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >> Please take a look and leave any comments
>> > you
>> > > > may
>> > > > > > > have!
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >> KIP:
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
>> > > > > > > > > > > > > > > > >> JIRA:
>> > > > > > > https://issues.apache.org/jira/browse/KAFKA-14097
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >> Thanks!
>> > > > > > > > > > > > > > > > >> Justine
>> > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to