I think that the point here is that the design that assumes that you can
keep all the PIDs in memory for all server configurations and all usages
and all client implementations is fraught with danger.

Yes, there are solutions already in place (KIP-854) that attempt to address
this problem, and other proposed solutions to remove that have undesirable
side effects (e.g. Heartbeat interrupted by IP failure for a slow producer
with a long delay between posts).  KAFKA-16229 (Slow expiration of Producer
IDs leading to high CPU usage) dealt with how to expire data from the cache
so that there was minimal lag time.

But the net issue is still the underlying design/architecture.

There are a  couple of salient points here:

   - The state of a state machine is only a view on its transactions.  This
   is the classic stream / table dichotomy.
   - What the "cache" is trying to do is create that view.
   - In some cases the size of the state exceeds the storage of the cache
   and the systems fail.
   - The current solutions have attempted to place limits on the size of
   the state.
   - Errors in implementation and or configuration will eventually lead to
   "problem producers"
   - Under the adopted fixes and current slate of proposals, the "problem
   producers" solutions have cascading side effects on properly behaved
   producers. (e.g. dropping long running, slow producing producers)

For decades (at least since the 1980's and anecdotally since the 1960's)
there has been a solution to processing state where the size of the state
exceeded the memory available.  It is the solution that drove the idea that
you could have tables in Kafka.  The idea that we can store the hot PIDs in
memory using an LRU and write data to storage so that we can quickly find
things not in the cache is not new.  It has been proven.

I am arguing that we should not throw away state data because we are
running out of memory.  We should persist that data to disk and consider
the disk as the source of truth for state.

Claude


On Wed, May 15, 2024 at 7:42 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> +1 to the comment.
>
> > I still feel we are doing all of this only because of a few anti-pattern
> or misconfigured producers and not because we have “too many Producer”.  I
> believe that implementing Producer heartbeat and remove short-lived PIDs
> from the cache if we didn’t receive heartbeat will be more simpler and step
> on right direction  to improve idempotent logic and maybe try to make PID
> get reused between session which will implement a real idempotent producer
> instead of idempotent session.  I admit this wouldn’t help with old clients
> but it will put us on the right path.
>
> This issue is very complicated and I appreciate the attention on it.
> Hopefully we can find a good solution working together :)
>
> Justine
>
> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
> wrote:
>
> > Also in the rejection alternatives you listed an approved KIP which is a
> > bit confusing can you move this to motivations instead
> >
> > > On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> wrote:
> > >
> > > This is a proposal that should solve the OOM problem on the servers
> > without
> > > some of the other proposed KIPs being active.
> > >
> > > Full details in
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
> >
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to