I would say I have first hand knowledge of this issue as someone who
responds to such incidents as part of my work at Confluent over the past
couple years. :)

> We only persist the information for the length of time we retain
snapshots.
This seems a bit contradictory to me. We are going to persist (potentially)
useless information if we have no signal if the producer is still active.
This is the problem we have with old clients. We are always going to have
to draw the line for how long we allow a producer to have a gap in
producing vs how long we allow filling up with short-lived producers that
risk OOM.

With an LRU cache, we run into the same problem, as we will expire all
"well-behaved" infrequent producers that last produced before the burst of
short-lived clients. The benefit is that we don't have a solid line in the
sand and we only expire when we need to, but we will still risk expiring
active producers.

I am willing to discuss some solutions that work with older clients, but my
concern is spending too much time on a complicated solution and not
encouraging movement to newer and better clients.

Justine

On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> wrote:

> >
> >  Why should we persist useless information
> > for clients that are long gone and will never use it?
>
>
> We are not.  We only persist the information for the length of time we
> retain snapshots.   The change here is to make the snapshots work as longer
> term storage for infrequent producers and others would would be negatively
> affected by some of the solutions proposed.
>
> Your changes require changes in the clients.   Older clients will not be
> able to participate.  My change does not require client change.
> There are issues outside of the ones discussed.  I was told of this late
> last week.  I will endeavor to find someone with first hand knowledge of
> the issue and have them report on this thread.
>
> In addition, the use of an LRU amortizes the cache cleanup so we don't need
> a thread to expire things.  You still have the cache, the point is that it
> really is a cache, there is storage behind it.  Let the cache be a cache,
> let the snapshots be the storage backing behind the cache.
>
> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
> <jols...@confluent.io.invalid>
> wrote:
>
> > Respectfully, I don't agree. Why should we persist useless information
> > for clients that are long gone and will never use it?
> > This is why I'm suggesting we do something smarter when it comes to
> storing
> > data and only store data we actually need and have a use for.
> >
> > This is why I suggest the heartbeat. It gives us clear information (up to
> > the heartbeat interval) of which producers are worth keeping and which
> that
> > are not.
> > I'm not in favor of building a new and complicated system to try to guess
> > which information is needed. In my mind, if we have a ton of legitimately
> > active producers, we should scale up memory. If we don't there is no
> reason
> > to have high memory usage.
> >
> > Fixing the client also allows us to fix some of the other issues we have
> > with idempotent producers.
> >
> > Justine
> >
> > On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> wrote:
> >
> > > I think that the point here is that the design that assumes that you
> can
> > > keep all the PIDs in memory for all server configurations and all
> usages
> > > and all client implementations is fraught with danger.
> > >
> > > Yes, there are solutions already in place (KIP-854) that attempt to
> > address
> > > this problem, and other proposed solutions to remove that have
> > undesirable
> > > side effects (e.g. Heartbeat interrupted by IP failure for a slow
> > producer
> > > with a long delay between posts).  KAFKA-16229 (Slow expiration of
> > Producer
> > > IDs leading to high CPU usage) dealt with how to expire data from the
> > cache
> > > so that there was minimal lag time.
> > >
> > > But the net issue is still the underlying design/architecture.
> > >
> > > There are a  couple of salient points here:
> > >
> > >    - The state of a state machine is only a view on its transactions.
> > This
> > >    is the classic stream / table dichotomy.
> > >    - What the "cache" is trying to do is create that view.
> > >    - In some cases the size of the state exceeds the storage of the
> cache
> > >    and the systems fail.
> > >    - The current solutions have attempted to place limits on the size
> of
> > >    the state.
> > >    - Errors in implementation and or configuration will eventually lead
> > to
> > >    "problem producers"
> > >    - Under the adopted fixes and current slate of proposals, the
> "problem
> > >    producers" solutions have cascading side effects on properly behaved
> > >    producers. (e.g. dropping long running, slow producing producers)
> > >
> > > For decades (at least since the 1980's and anecdotally since the
> 1960's)
> > > there has been a solution to processing state where the size of the
> state
> > > exceeded the memory available.  It is the solution that drove the idea
> > that
> > > you could have tables in Kafka.  The idea that we can store the hot
> PIDs
> > in
> > > memory using an LRU and write data to storage so that we can quickly
> find
> > > things not in the cache is not new.  It has been proven.
> > >
> > > I am arguing that we should not throw away state data because we are
> > > running out of memory.  We should persist that data to disk and
> consider
> > > the disk as the source of truth for state.
> > >
> > > Claude
> > >
> > >
> > > On Wed, May 15, 2024 at 7:42 PM Justine Olshan
> > > <jols...@confluent.io.invalid>
> > > wrote:
> > >
> > > > +1 to the comment.
> > > >
> > > > > I still feel we are doing all of this only because of a few
> > > anti-pattern
> > > > or misconfigured producers and not because we have “too many
> Producer”.
> > > I
> > > > believe that implementing Producer heartbeat and remove short-lived
> > PIDs
> > > > from the cache if we didn’t receive heartbeat will be more simpler
> and
> > > step
> > > > on right direction  to improve idempotent logic and maybe try to make
> > PID
> > > > get reused between session which will implement a real idempotent
> > > producer
> > > > instead of idempotent session.  I admit this wouldn’t help with old
> > > clients
> > > > but it will put us on the right path.
> > > >
> > > > This issue is very complicated and I appreciate the attention on it.
> > > > Hopefully we can find a good solution working together :)
> > > >
> > > > Justine
> > > >
> > > > On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <
> o.g.h.ibra...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Also in the rejection alternatives you listed an approved KIP which
> > is
> > > a
> > > > > bit confusing can you move this to motivations instead
> > > > >
> > > > > > On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org>
> wrote:
> > > > > >
> > > > > > This is a proposal that should solve the OOM problem on the
> servers
> > > > > without
> > > > > > some of the other proposed KIPs being active.
> > > > > >
> > > > > > Full details in
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >
> >
>
>
> --
> LinkedIn: http://www.linkedin.com/in/claudewarren
>

Reply via email to