My team has looked at it from a high level, but we haven't had the time to come up with a full proposal.
I'm not aware if others have worked on it. Justine On Mon, May 20, 2024 at 10:21 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com> wrote: > Hi Justine are you aware of anyone looking into such new protocol at the > moment? > > > On 20 May 2024, at 18:00, Justine Olshan <jols...@confluent.io.INVALID> > wrote: > > > > I would say I have first hand knowledge of this issue as someone who > > responds to such incidents as part of my work at Confluent over the past > > couple years. :) > > > >> We only persist the information for the length of time we retain > > snapshots. > > This seems a bit contradictory to me. We are going to persist > (potentially) > > useless information if we have no signal if the producer is still active. > > This is the problem we have with old clients. We are always going to have > > to draw the line for how long we allow a producer to have a gap in > > producing vs how long we allow filling up with short-lived producers that > > risk OOM. > > > > With an LRU cache, we run into the same problem, as we will expire all > > "well-behaved" infrequent producers that last produced before the burst > of > > short-lived clients. The benefit is that we don't have a solid line in > the > > sand and we only expire when we need to, but we will still risk expiring > > active producers. > > > > I am willing to discuss some solutions that work with older clients, but > my > > concern is spending too much time on a complicated solution and not > > encouraging movement to newer and better clients. > > > > Justine > > > > On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> wrote: > > > >>> > >>> Why should we persist useless information > >>> for clients that are long gone and will never use it? > >> > >> > >> We are not. We only persist the information for the length of time we > >> retain snapshots. The change here is to make the snapshots work as > longer > >> term storage for infrequent producers and others would would be > negatively > >> affected by some of the solutions proposed. > >> > >> Your changes require changes in the clients. Older clients will not be > >> able to participate. My change does not require client change. > >> There are issues outside of the ones discussed. I was told of this late > >> last week. I will endeavor to find someone with first hand knowledge of > >> the issue and have them report on this thread. > >> > >> In addition, the use of an LRU amortizes the cache cleanup so we don't > need > >> a thread to expire things. You still have the cache, the point is that > it > >> really is a cache, there is storage behind it. Let the cache be a > cache, > >> let the snapshots be the storage backing behind the cache. > >> > >> On Fri, May 17, 2024 at 5:26 PM Justine Olshan > >> <jols...@confluent.io.invalid> > >> wrote: > >> > >>> Respectfully, I don't agree. Why should we persist useless information > >>> for clients that are long gone and will never use it? > >>> This is why I'm suggesting we do something smarter when it comes to > >> storing > >>> data and only store data we actually need and have a use for. > >>> > >>> This is why I suggest the heartbeat. It gives us clear information (up > to > >>> the heartbeat interval) of which producers are worth keeping and which > >> that > >>> are not. > >>> I'm not in favor of building a new and complicated system to try to > guess > >>> which information is needed. In my mind, if we have a ton of > legitimately > >>> active producers, we should scale up memory. If we don't there is no > >> reason > >>> to have high memory usage. > >>> > >>> Fixing the client also allows us to fix some of the other issues we > have > >>> with idempotent producers. > >>> > >>> Justine > >>> > >>> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> > wrote: > >>> > >>>> I think that the point here is that the design that assumes that you > >> can > >>>> keep all the PIDs in memory for all server configurations and all > >> usages > >>>> and all client implementations is fraught with danger. > >>>> > >>>> Yes, there are solutions already in place (KIP-854) that attempt to > >>> address > >>>> this problem, and other proposed solutions to remove that have > >>> undesirable > >>>> side effects (e.g. Heartbeat interrupted by IP failure for a slow > >>> producer > >>>> with a long delay between posts). KAFKA-16229 (Slow expiration of > >>> Producer > >>>> IDs leading to high CPU usage) dealt with how to expire data from the > >>> cache > >>>> so that there was minimal lag time. > >>>> > >>>> But the net issue is still the underlying design/architecture. > >>>> > >>>> There are a couple of salient points here: > >>>> > >>>> - The state of a state machine is only a view on its transactions. > >>> This > >>>> is the classic stream / table dichotomy. > >>>> - What the "cache" is trying to do is create that view. > >>>> - In some cases the size of the state exceeds the storage of the > >> cache > >>>> and the systems fail. > >>>> - The current solutions have attempted to place limits on the size > >> of > >>>> the state. > >>>> - Errors in implementation and or configuration will eventually lead > >>> to > >>>> "problem producers" > >>>> - Under the adopted fixes and current slate of proposals, the > >> "problem > >>>> producers" solutions have cascading side effects on properly behaved > >>>> producers. (e.g. dropping long running, slow producing producers) > >>>> > >>>> For decades (at least since the 1980's and anecdotally since the > >> 1960's) > >>>> there has been a solution to processing state where the size of the > >> state > >>>> exceeded the memory available. It is the solution that drove the idea > >>> that > >>>> you could have tables in Kafka. The idea that we can store the hot > >> PIDs > >>> in > >>>> memory using an LRU and write data to storage so that we can quickly > >> find > >>>> things not in the cache is not new. It has been proven. > >>>> > >>>> I am arguing that we should not throw away state data because we are > >>>> running out of memory. We should persist that data to disk and > >> consider > >>>> the disk as the source of truth for state. > >>>> > >>>> Claude > >>>> > >>>> > >>>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan > >>>> <jols...@confluent.io.invalid> > >>>> wrote: > >>>> > >>>>> +1 to the comment. > >>>>> > >>>>>> I still feel we are doing all of this only because of a few > >>>> anti-pattern > >>>>> or misconfigured producers and not because we have “too many > >> Producer”. > >>>> I > >>>>> believe that implementing Producer heartbeat and remove short-lived > >>> PIDs > >>>>> from the cache if we didn’t receive heartbeat will be more simpler > >> and > >>>> step > >>>>> on right direction to improve idempotent logic and maybe try to make > >>> PID > >>>>> get reused between session which will implement a real idempotent > >>>> producer > >>>>> instead of idempotent session. I admit this wouldn’t help with old > >>>> clients > >>>>> but it will put us on the right path. > >>>>> > >>>>> This issue is very complicated and I appreciate the attention on it. > >>>>> Hopefully we can find a good solution working together :) > >>>>> > >>>>> Justine > >>>>> > >>>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim < > >> o.g.h.ibra...@gmail.com > >>>> > >>>>> wrote: > >>>>> > >>>>>> Also in the rejection alternatives you listed an approved KIP which > >>> is > >>>> a > >>>>>> bit confusing can you move this to motivations instead > >>>>>> > >>>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> > >> wrote: > >>>>>>> > >>>>>>> This is a proposal that should solve the OOM problem on the > >> servers > >>>>>> without > >>>>>>> some of the other proposed KIPs being active. > >>>>>>> > >>>>>>> Full details in > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> LinkedIn: http://www.linkedin.com/in/claudewarren > >>>> > >>> > >> > >> > >> -- > >> LinkedIn: http://www.linkedin.com/in/claudewarren > >> > >