Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

Justine Olshan Mon, 20 May 2024 10:32:22 -0700

My team has looked at it from a high level, but we haven't had the time to
come up with a full proposal.


I'm not aware if others have worked on it.

Justine

On Mon, May 20, 2024 at 10:21 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
wrote:

> Hi Justine are you aware of anyone looking into such new protocol at the
> moment?
>
> > On 20 May 2024, at 18:00, Justine Olshan <jols...@confluent.io.INVALID>
> wrote:
> >
> > I would say I have first hand knowledge of this issue as someone who
> > responds to such incidents as part of my work at Confluent over the past
> > couple years. :)
> >
> >> We only persist the information for the length of time we retain
> > snapshots.
> > This seems a bit contradictory to me. We are going to persist
> (potentially)
> > useless information if we have no signal if the producer is still active.
> > This is the problem we have with old clients. We are always going to have
> > to draw the line for how long we allow a producer to have a gap in
> > producing vs how long we allow filling up with short-lived producers that
> > risk OOM.
> >
> > With an LRU cache, we run into the same problem, as we will expire all
> > "well-behaved" infrequent producers that last produced before the burst
> of
> > short-lived clients. The benefit is that we don't have a solid line in
> the
> > sand and we only expire when we need to, but we will still risk expiring
> > active producers.
> >
> > I am willing to discuss some solutions that work with older clients, but
> my
> > concern is spending too much time on a complicated solution and not
> > encouraging movement to newer and better clients.
> >
> > Justine
> >
> > On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> wrote:
> >
> >>>
> >>> Why should we persist useless information
> >>> for clients that are long gone and will never use it?
> >>
> >>
> >> We are not.  We only persist the information for the length of time we
> >> retain snapshots.   The change here is to make the snapshots work as
> longer
> >> term storage for infrequent producers and others would would be
> negatively
> >> affected by some of the solutions proposed.
> >>
> >> Your changes require changes in the clients.   Older clients will not be
> >> able to participate.  My change does not require client change.
> >> There are issues outside of the ones discussed.  I was told of this late
> >> last week.  I will endeavor to find someone with first hand knowledge of
> >> the issue and have them report on this thread.
> >>
> >> In addition, the use of an LRU amortizes the cache cleanup so we don't
> need
> >> a thread to expire things.  You still have the cache, the point is that
> it
> >> really is a cache, there is storage behind it.  Let the cache be a
> cache,
> >> let the snapshots be the storage backing behind the cache.
> >>
> >> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
> >> <jols...@confluent.io.invalid>
> >> wrote:
> >>
> >>> Respectfully, I don't agree. Why should we persist useless information
> >>> for clients that are long gone and will never use it?
> >>> This is why I'm suggesting we do something smarter when it comes to
> >> storing
> >>> data and only store data we actually need and have a use for.
> >>>
> >>> This is why I suggest the heartbeat. It gives us clear information (up
> to
> >>> the heartbeat interval) of which producers are worth keeping and which
> >> that
> >>> are not.
> >>> I'm not in favor of building a new and complicated system to try to
> guess
> >>> which information is needed. In my mind, if we have a ton of
> legitimately
> >>> active producers, we should scale up memory. If we don't there is no
> >> reason
> >>> to have high memory usage.
> >>>
> >>> Fixing the client also allows us to fix some of the other issues we
> have
> >>> with idempotent producers.
> >>>
> >>> Justine
> >>>
> >>> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com>
> wrote:
> >>>
> >>>> I think that the point here is that the design that assumes that you
> >> can
> >>>> keep all the PIDs in memory for all server configurations and all
> >> usages
> >>>> and all client implementations is fraught with danger.
> >>>>
> >>>> Yes, there are solutions already in place (KIP-854) that attempt to
> >>> address
> >>>> this problem, and other proposed solutions to remove that have
> >>> undesirable
> >>>> side effects (e.g. Heartbeat interrupted by IP failure for a slow
> >>> producer
> >>>> with a long delay between posts).  KAFKA-16229 (Slow expiration of
> >>> Producer
> >>>> IDs leading to high CPU usage) dealt with how to expire data from the
> >>> cache
> >>>> so that there was minimal lag time.
> >>>>
> >>>> But the net issue is still the underlying design/architecture.
> >>>>
> >>>> There are a  couple of salient points here:
> >>>>
> >>>>   - The state of a state machine is only a view on its transactions.
> >>> This
> >>>>   is the classic stream / table dichotomy.
> >>>>   - What the "cache" is trying to do is create that view.
> >>>>   - In some cases the size of the state exceeds the storage of the
> >> cache
> >>>>   and the systems fail.
> >>>>   - The current solutions have attempted to place limits on the size
> >> of
> >>>>   the state.
> >>>>   - Errors in implementation and or configuration will eventually lead
> >>> to
> >>>>   "problem producers"
> >>>>   - Under the adopted fixes and current slate of proposals, the
> >> "problem
> >>>>   producers" solutions have cascading side effects on properly behaved
> >>>>   producers. (e.g. dropping long running, slow producing producers)
> >>>>
> >>>> For decades (at least since the 1980's and anecdotally since the
> >> 1960's)
> >>>> there has been a solution to processing state where the size of the
> >> state
> >>>> exceeded the memory available.  It is the solution that drove the idea
> >>> that
> >>>> you could have tables in Kafka.  The idea that we can store the hot
> >> PIDs
> >>> in
> >>>> memory using an LRU and write data to storage so that we can quickly
> >> find
> >>>> things not in the cache is not new.  It has been proven.
> >>>>
> >>>> I am arguing that we should not throw away state data because we are
> >>>> running out of memory.  We should persist that data to disk and
> >> consider
> >>>> the disk as the source of truth for state.
> >>>>
> >>>> Claude
> >>>>
> >>>>
> >>>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan
> >>>> <jols...@confluent.io.invalid>
> >>>> wrote:
> >>>>
> >>>>> +1 to the comment.
> >>>>>
> >>>>>> I still feel we are doing all of this only because of a few
> >>>> anti-pattern
> >>>>> or misconfigured producers and not because we have “too many
> >> Producer”.
> >>>> I
> >>>>> believe that implementing Producer heartbeat and remove short-lived
> >>> PIDs
> >>>>> from the cache if we didn’t receive heartbeat will be more simpler
> >> and
> >>>> step
> >>>>> on right direction  to improve idempotent logic and maybe try to make
> >>> PID
> >>>>> get reused between session which will implement a real idempotent
> >>>> producer
> >>>>> instead of idempotent session.  I admit this wouldn’t help with old
> >>>> clients
> >>>>> but it will put us on the right path.
> >>>>>
> >>>>> This issue is very complicated and I appreciate the attention on it.
> >>>>> Hopefully we can find a good solution working together :)
> >>>>>
> >>>>> Justine
> >>>>>
> >>>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <
> >> o.g.h.ibra...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Also in the rejection alternatives you listed an approved KIP which
> >>> is
> >>>> a
> >>>>>> bit confusing can you move this to motivations instead
> >>>>>>
> >>>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org>
> >> wrote:
> >>>>>>>
> >>>>>>> This is a proposal that should solve the OOM problem on the
> >> servers
> >>>>>> without
> >>>>>>> some of the other proposed KIPs being active.
> >>>>>>>
> >>>>>>> Full details in
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>
> >>>
> >>
> >>
> >> --
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
>
>

Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

Reply via email to