Hi Justine are you aware of anyone looking into such new protocol at the moment?
> On 20 May 2024, at 18:00, Justine Olshan <jols...@confluent.io.INVALID> wrote: > > I would say I have first hand knowledge of this issue as someone who > responds to such incidents as part of my work at Confluent over the past > couple years. :) > >> We only persist the information for the length of time we retain > snapshots. > This seems a bit contradictory to me. We are going to persist (potentially) > useless information if we have no signal if the producer is still active. > This is the problem we have with old clients. We are always going to have > to draw the line for how long we allow a producer to have a gap in > producing vs how long we allow filling up with short-lived producers that > risk OOM. > > With an LRU cache, we run into the same problem, as we will expire all > "well-behaved" infrequent producers that last produced before the burst of > short-lived clients. The benefit is that we don't have a solid line in the > sand and we only expire when we need to, but we will still risk expiring > active producers. > > I am willing to discuss some solutions that work with older clients, but my > concern is spending too much time on a complicated solution and not > encouraging movement to newer and better clients. > > Justine > > On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> wrote: > >>> >>> Why should we persist useless information >>> for clients that are long gone and will never use it? >> >> >> We are not. We only persist the information for the length of time we >> retain snapshots. The change here is to make the snapshots work as longer >> term storage for infrequent producers and others would would be negatively >> affected by some of the solutions proposed. >> >> Your changes require changes in the clients. Older clients will not be >> able to participate. My change does not require client change. >> There are issues outside of the ones discussed. I was told of this late >> last week. I will endeavor to find someone with first hand knowledge of >> the issue and have them report on this thread. >> >> In addition, the use of an LRU amortizes the cache cleanup so we don't need >> a thread to expire things. You still have the cache, the point is that it >> really is a cache, there is storage behind it. Let the cache be a cache, >> let the snapshots be the storage backing behind the cache. >> >> On Fri, May 17, 2024 at 5:26 PM Justine Olshan >> <jols...@confluent.io.invalid> >> wrote: >> >>> Respectfully, I don't agree. Why should we persist useless information >>> for clients that are long gone and will never use it? >>> This is why I'm suggesting we do something smarter when it comes to >> storing >>> data and only store data we actually need and have a use for. >>> >>> This is why I suggest the heartbeat. It gives us clear information (up to >>> the heartbeat interval) of which producers are worth keeping and which >> that >>> are not. >>> I'm not in favor of building a new and complicated system to try to guess >>> which information is needed. In my mind, if we have a ton of legitimately >>> active producers, we should scale up memory. If we don't there is no >> reason >>> to have high memory usage. >>> >>> Fixing the client also allows us to fix some of the other issues we have >>> with idempotent producers. >>> >>> Justine >>> >>> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> wrote: >>> >>>> I think that the point here is that the design that assumes that you >> can >>>> keep all the PIDs in memory for all server configurations and all >> usages >>>> and all client implementations is fraught with danger. >>>> >>>> Yes, there are solutions already in place (KIP-854) that attempt to >>> address >>>> this problem, and other proposed solutions to remove that have >>> undesirable >>>> side effects (e.g. Heartbeat interrupted by IP failure for a slow >>> producer >>>> with a long delay between posts). KAFKA-16229 (Slow expiration of >>> Producer >>>> IDs leading to high CPU usage) dealt with how to expire data from the >>> cache >>>> so that there was minimal lag time. >>>> >>>> But the net issue is still the underlying design/architecture. >>>> >>>> There are a couple of salient points here: >>>> >>>> - The state of a state machine is only a view on its transactions. >>> This >>>> is the classic stream / table dichotomy. >>>> - What the "cache" is trying to do is create that view. >>>> - In some cases the size of the state exceeds the storage of the >> cache >>>> and the systems fail. >>>> - The current solutions have attempted to place limits on the size >> of >>>> the state. >>>> - Errors in implementation and or configuration will eventually lead >>> to >>>> "problem producers" >>>> - Under the adopted fixes and current slate of proposals, the >> "problem >>>> producers" solutions have cascading side effects on properly behaved >>>> producers. (e.g. dropping long running, slow producing producers) >>>> >>>> For decades (at least since the 1980's and anecdotally since the >> 1960's) >>>> there has been a solution to processing state where the size of the >> state >>>> exceeded the memory available. It is the solution that drove the idea >>> that >>>> you could have tables in Kafka. The idea that we can store the hot >> PIDs >>> in >>>> memory using an LRU and write data to storage so that we can quickly >> find >>>> things not in the cache is not new. It has been proven. >>>> >>>> I am arguing that we should not throw away state data because we are >>>> running out of memory. We should persist that data to disk and >> consider >>>> the disk as the source of truth for state. >>>> >>>> Claude >>>> >>>> >>>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan >>>> <jols...@confluent.io.invalid> >>>> wrote: >>>> >>>>> +1 to the comment. >>>>> >>>>>> I still feel we are doing all of this only because of a few >>>> anti-pattern >>>>> or misconfigured producers and not because we have “too many >> Producer”. >>>> I >>>>> believe that implementing Producer heartbeat and remove short-lived >>> PIDs >>>>> from the cache if we didn’t receive heartbeat will be more simpler >> and >>>> step >>>>> on right direction to improve idempotent logic and maybe try to make >>> PID >>>>> get reused between session which will implement a real idempotent >>>> producer >>>>> instead of idempotent session. I admit this wouldn’t help with old >>>> clients >>>>> but it will put us on the right path. >>>>> >>>>> This issue is very complicated and I appreciate the attention on it. >>>>> Hopefully we can find a good solution working together :) >>>>> >>>>> Justine >>>>> >>>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim < >> o.g.h.ibra...@gmail.com >>>> >>>>> wrote: >>>>> >>>>>> Also in the rejection alternatives you listed an approved KIP which >>> is >>>> a >>>>>> bit confusing can you move this to motivations instead >>>>>> >>>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> >> wrote: >>>>>>> >>>>>>> This is a proposal that should solve the OOM problem on the >> servers >>>>>> without >>>>>>> some of the other proposed KIPs being active. >>>>>>> >>>>>>> Full details in >>>>>>> >>>>>> >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> LinkedIn: http://www.linkedin.com/in/claudewarren >>>> >>> >> >> >> -- >> LinkedIn: http://www.linkedin.com/in/claudewarren >>