Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

Omnia Ibrahim Mon, 20 May 2024 10:21:14 -0700

Hi Justine are you aware of anyone looking into such new protocol at the moment?


> On 20 May 2024, at 18:00, Justine Olshan <jols...@confluent.io.INVALID> wrote:
> 
> I would say I have first hand knowledge of this issue as someone who
> responds to such incidents as part of my work at Confluent over the past
> couple years. :)
> 
>> We only persist the information for the length of time we retain
> snapshots.
> This seems a bit contradictory to me. We are going to persist (potentially)
> useless information if we have no signal if the producer is still active.
> This is the problem we have with old clients. We are always going to have
> to draw the line for how long we allow a producer to have a gap in
> producing vs how long we allow filling up with short-lived producers that
> risk OOM.
> 
> With an LRU cache, we run into the same problem, as we will expire all
> "well-behaved" infrequent producers that last produced before the burst of
> short-lived clients. The benefit is that we don't have a solid line in the
> sand and we only expire when we need to, but we will still risk expiring
> active producers.
> 
> I am willing to discuss some solutions that work with older clients, but my
> concern is spending too much time on a complicated solution and not
> encouraging movement to newer and better clients.
> 
> Justine
> 
> On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> wrote:
> 
>>> 
>>> Why should we persist useless information
>>> for clients that are long gone and will never use it?
>> 
>> 
>> We are not.  We only persist the information for the length of time we
>> retain snapshots.   The change here is to make the snapshots work as longer
>> term storage for infrequent producers and others would would be negatively
>> affected by some of the solutions proposed.
>> 
>> Your changes require changes in the clients.   Older clients will not be
>> able to participate.  My change does not require client change.
>> There are issues outside of the ones discussed.  I was told of this late
>> last week.  I will endeavor to find someone with first hand knowledge of
>> the issue and have them report on this thread.
>> 
>> In addition, the use of an LRU amortizes the cache cleanup so we don't need
>> a thread to expire things.  You still have the cache, the point is that it
>> really is a cache, there is storage behind it.  Let the cache be a cache,
>> let the snapshots be the storage backing behind the cache.
>> 
>> On Fri, May 17, 2024 at 5:26 PM Justine Olshan
>> <jols...@confluent.io.invalid>
>> wrote:
>> 
>>> Respectfully, I don't agree. Why should we persist useless information
>>> for clients that are long gone and will never use it?
>>> This is why I'm suggesting we do something smarter when it comes to
>> storing
>>> data and only store data we actually need and have a use for.
>>> 
>>> This is why I suggest the heartbeat. It gives us clear information (up to
>>> the heartbeat interval) of which producers are worth keeping and which
>> that
>>> are not.
>>> I'm not in favor of building a new and complicated system to try to guess
>>> which information is needed. In my mind, if we have a ton of legitimately
>>> active producers, we should scale up memory. If we don't there is no
>> reason
>>> to have high memory usage.
>>> 
>>> Fixing the client also allows us to fix some of the other issues we have
>>> with idempotent producers.
>>> 
>>> Justine
>>> 
>>> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> wrote:
>>> 
>>>> I think that the point here is that the design that assumes that you
>> can
>>>> keep all the PIDs in memory for all server configurations and all
>> usages
>>>> and all client implementations is fraught with danger.
>>>> 
>>>> Yes, there are solutions already in place (KIP-854) that attempt to
>>> address
>>>> this problem, and other proposed solutions to remove that have
>>> undesirable
>>>> side effects (e.g. Heartbeat interrupted by IP failure for a slow
>>> producer
>>>> with a long delay between posts).  KAFKA-16229 (Slow expiration of
>>> Producer
>>>> IDs leading to high CPU usage) dealt with how to expire data from the
>>> cache
>>>> so that there was minimal lag time.
>>>> 
>>>> But the net issue is still the underlying design/architecture.
>>>> 
>>>> There are a  couple of salient points here:
>>>> 
>>>>   - The state of a state machine is only a view on its transactions.
>>> This
>>>>   is the classic stream / table dichotomy.
>>>>   - What the "cache" is trying to do is create that view.
>>>>   - In some cases the size of the state exceeds the storage of the
>> cache
>>>>   and the systems fail.
>>>>   - The current solutions have attempted to place limits on the size
>> of
>>>>   the state.
>>>>   - Errors in implementation and or configuration will eventually lead
>>> to
>>>>   "problem producers"
>>>>   - Under the adopted fixes and current slate of proposals, the
>> "problem
>>>>   producers" solutions have cascading side effects on properly behaved
>>>>   producers. (e.g. dropping long running, slow producing producers)
>>>> 
>>>> For decades (at least since the 1980's and anecdotally since the
>> 1960's)
>>>> there has been a solution to processing state where the size of the
>> state
>>>> exceeded the memory available.  It is the solution that drove the idea
>>> that
>>>> you could have tables in Kafka.  The idea that we can store the hot
>> PIDs
>>> in
>>>> memory using an LRU and write data to storage so that we can quickly
>> find
>>>> things not in the cache is not new.  It has been proven.
>>>> 
>>>> I am arguing that we should not throw away state data because we are
>>>> running out of memory.  We should persist that data to disk and
>> consider
>>>> the disk as the source of truth for state.
>>>> 
>>>> Claude
>>>> 
>>>> 
>>>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan
>>>> <jols...@confluent.io.invalid>
>>>> wrote:
>>>> 
>>>>> +1 to the comment.
>>>>> 
>>>>>> I still feel we are doing all of this only because of a few
>>>> anti-pattern
>>>>> or misconfigured producers and not because we have “too many
>> Producer”.
>>>> I
>>>>> believe that implementing Producer heartbeat and remove short-lived
>>> PIDs
>>>>> from the cache if we didn’t receive heartbeat will be more simpler
>> and
>>>> step
>>>>> on right direction  to improve idempotent logic and maybe try to make
>>> PID
>>>>> get reused between session which will implement a real idempotent
>>>> producer
>>>>> instead of idempotent session.  I admit this wouldn’t help with old
>>>> clients
>>>>> but it will put us on the right path.
>>>>> 
>>>>> This issue is very complicated and I appreciate the attention on it.
>>>>> Hopefully we can find a good solution working together :)
>>>>> 
>>>>> Justine
>>>>> 
>>>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <
>> o.g.h.ibra...@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Also in the rejection alternatives you listed an approved KIP which
>>> is
>>>> a
>>>>>> bit confusing can you move this to motivations instead
>>>>>> 
>>>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org>
>> wrote:
>>>>>>> 
>>>>>>> This is a proposal that should solve the OOM problem on the
>> servers
>>>>>> without
>>>>>>> some of the other proposed KIPs being active.
>>>>>>> 
>>>>>>> Full details in
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>> 
>>> 
>> 
>> 
>> --
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>

Re: [DISCUSS] KIP-1044: A proposal to change idempotent producer -- server implementation

Reply via email to