Can you clarify the intended behavior? If we encounter a producer ID we've not seen before, we are supposed to read from disk and try to find it? I see the proposal mentions bloom filters, but it seems like it would not be cheap to search for the producer ID. I would expect the typical case to be that there is a new producer and we don't need to search state.
And we intend to keep all producers we've ever seen on the cluster? I didn't see a mechanism to delete any of the information in the snapshots. Currently the snapshot logic is decoupled from the log retention as of KIP-360. Justine On Mon, May 20, 2024 at 11:20 PM Claude Warren <cla...@xenei.com> wrote: > The LRU cache is just that: a cache, so yes things expire from the cache > but they are not gone. As long as a snapshot containing the PID is > available the PID can be found and reloaded into the cache (which is > exactly what I would expect it to do). > > The question of how long a PID is resolvable then becomes a question of how > long are snapshots retained. > > There are, in my mind, several advantages: > > 1. The in-memory cache can be smaller, reducing the memory footprint. > This is not required but is possible. > 2. PIDs are never discarded because they are produced by slow > producers. They are discarded when the snapshots containing them > expire. > 3. The length of time between when a PID is received by the server and > when it is recorded to a snapshot is significantly reduced. > Significantly > reducing the window where PIDs can be lost. > 4. Throttling and other changes you wish to make to the cache are still > possible. > > > On Mon, May 20, 2024 at 7:32 PM Justine Olshan > <jols...@confluent.io.invalid> > wrote: > > > My team has looked at it from a high level, but we haven't had the time > to > > come up with a full proposal. > > > > I'm not aware if others have worked on it. > > > > Justine > > > > On Mon, May 20, 2024 at 10:21 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com> > > wrote: > > > > > Hi Justine are you aware of anyone looking into such new protocol at > the > > > moment? > > > > > > > On 20 May 2024, at 18:00, Justine Olshan > <jols...@confluent.io.INVALID > > > > > > wrote: > > > > > > > > I would say I have first hand knowledge of this issue as someone who > > > > responds to such incidents as part of my work at Confluent over the > > past > > > > couple years. :) > > > > > > > >> We only persist the information for the length of time we retain > > > > snapshots. > > > > This seems a bit contradictory to me. We are going to persist > > > (potentially) > > > > useless information if we have no signal if the producer is still > > active. > > > > This is the problem we have with old clients. We are always going to > > have > > > > to draw the line for how long we allow a producer to have a gap in > > > > producing vs how long we allow filling up with short-lived producers > > that > > > > risk OOM. > > > > > > > > With an LRU cache, we run into the same problem, as we will expire > all > > > > "well-behaved" infrequent producers that last produced before the > burst > > > of > > > > short-lived clients. The benefit is that we don't have a solid line > in > > > the > > > > sand and we only expire when we need to, but we will still risk > > expiring > > > > active producers. > > > > > > > > I am willing to discuss some solutions that work with older clients, > > but > > > my > > > > concern is spending too much time on a complicated solution and not > > > > encouraging movement to newer and better clients. > > > > > > > > Justine > > > > > > > > On Mon, May 20, 2024 at 9:35 AM Claude Warren <cla...@xenei.com> > > wrote: > > > > > > > >>> > > > >>> Why should we persist useless information > > > >>> for clients that are long gone and will never use it? > > > >> > > > >> > > > >> We are not. We only persist the information for the length of time > we > > > >> retain snapshots. The change here is to make the snapshots work as > > > longer > > > >> term storage for infrequent producers and others would would be > > > negatively > > > >> affected by some of the solutions proposed. > > > >> > > > >> Your changes require changes in the clients. Older clients will > not > > be > > > >> able to participate. My change does not require client change. > > > >> There are issues outside of the ones discussed. I was told of this > > late > > > >> last week. I will endeavor to find someone with first hand > knowledge > > of > > > >> the issue and have them report on this thread. > > > >> > > > >> In addition, the use of an LRU amortizes the cache cleanup so we > don't > > > need > > > >> a thread to expire things. You still have the cache, the point is > > that > > > it > > > >> really is a cache, there is storage behind it. Let the cache be a > > > cache, > > > >> let the snapshots be the storage backing behind the cache. > > > >> > > > >> On Fri, May 17, 2024 at 5:26 PM Justine Olshan > > > >> <jols...@confluent.io.invalid> > > > >> wrote: > > > >> > > > >>> Respectfully, I don't agree. Why should we persist useless > > information > > > >>> for clients that are long gone and will never use it? > > > >>> This is why I'm suggesting we do something smarter when it comes to > > > >> storing > > > >>> data and only store data we actually need and have a use for. > > > >>> > > > >>> This is why I suggest the heartbeat. It gives us clear information > > (up > > > to > > > >>> the heartbeat interval) of which producers are worth keeping and > > which > > > >> that > > > >>> are not. > > > >>> I'm not in favor of building a new and complicated system to try to > > > guess > > > >>> which information is needed. In my mind, if we have a ton of > > > legitimately > > > >>> active producers, we should scale up memory. If we don't there is > no > > > >> reason > > > >>> to have high memory usage. > > > >>> > > > >>> Fixing the client also allows us to fix some of the other issues we > > > have > > > >>> with idempotent producers. > > > >>> > > > >>> Justine > > > >>> > > > >>> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> > > > wrote: > > > >>> > > > >>>> I think that the point here is that the design that assumes that > you > > > >> can > > > >>>> keep all the PIDs in memory for all server configurations and all > > > >> usages > > > >>>> and all client implementations is fraught with danger. > > > >>>> > > > >>>> Yes, there are solutions already in place (KIP-854) that attempt > to > > > >>> address > > > >>>> this problem, and other proposed solutions to remove that have > > > >>> undesirable > > > >>>> side effects (e.g. Heartbeat interrupted by IP failure for a slow > > > >>> producer > > > >>>> with a long delay between posts). KAFKA-16229 (Slow expiration of > > > >>> Producer > > > >>>> IDs leading to high CPU usage) dealt with how to expire data from > > the > > > >>> cache > > > >>>> so that there was minimal lag time. > > > >>>> > > > >>>> But the net issue is still the underlying design/architecture. > > > >>>> > > > >>>> There are a couple of salient points here: > > > >>>> > > > >>>> - The state of a state machine is only a view on its > transactions. > > > >>> This > > > >>>> is the classic stream / table dichotomy. > > > >>>> - What the "cache" is trying to do is create that view. > > > >>>> - In some cases the size of the state exceeds the storage of the > > > >> cache > > > >>>> and the systems fail. > > > >>>> - The current solutions have attempted to place limits on the > size > > > >> of > > > >>>> the state. > > > >>>> - Errors in implementation and or configuration will eventually > > lead > > > >>> to > > > >>>> "problem producers" > > > >>>> - Under the adopted fixes and current slate of proposals, the > > > >> "problem > > > >>>> producers" solutions have cascading side effects on properly > > behaved > > > >>>> producers. (e.g. dropping long running, slow producing > producers) > > > >>>> > > > >>>> For decades (at least since the 1980's and anecdotally since the > > > >> 1960's) > > > >>>> there has been a solution to processing state where the size of > the > > > >> state > > > >>>> exceeded the memory available. It is the solution that drove the > > idea > > > >>> that > > > >>>> you could have tables in Kafka. The idea that we can store the > hot > > > >> PIDs > > > >>> in > > > >>>> memory using an LRU and write data to storage so that we can > quickly > > > >> find > > > >>>> things not in the cache is not new. It has been proven. > > > >>>> > > > >>>> I am arguing that we should not throw away state data because we > are > > > >>>> running out of memory. We should persist that data to disk and > > > >> consider > > > >>>> the disk as the source of truth for state. > > > >>>> > > > >>>> Claude > > > >>>> > > > >>>> > > > >>>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan > > > >>>> <jols...@confluent.io.invalid> > > > >>>> wrote: > > > >>>> > > > >>>>> +1 to the comment. > > > >>>>> > > > >>>>>> I still feel we are doing all of this only because of a few > > > >>>> anti-pattern > > > >>>>> or misconfigured producers and not because we have “too many > > > >> Producer”. > > > >>>> I > > > >>>>> believe that implementing Producer heartbeat and remove > short-lived > > > >>> PIDs > > > >>>>> from the cache if we didn’t receive heartbeat will be more > simpler > > > >> and > > > >>>> step > > > >>>>> on right direction to improve idempotent logic and maybe try to > > make > > > >>> PID > > > >>>>> get reused between session which will implement a real idempotent > > > >>>> producer > > > >>>>> instead of idempotent session. I admit this wouldn’t help with > old > > > >>>> clients > > > >>>>> but it will put us on the right path. > > > >>>>> > > > >>>>> This issue is very complicated and I appreciate the attention on > > it. > > > >>>>> Hopefully we can find a good solution working together :) > > > >>>>> > > > >>>>> Justine > > > >>>>> > > > >>>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim < > > > >> o.g.h.ibra...@gmail.com > > > >>>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Also in the rejection alternatives you listed an approved KIP > > which > > > >>> is > > > >>>> a > > > >>>>>> bit confusing can you move this to motivations instead > > > >>>>>> > > > >>>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> > > > >> wrote: > > > >>>>>>> > > > >>>>>>> This is a proposal that should solve the OOM problem on the > > > >> servers > > > >>>>>> without > > > >>>>>>> some of the other proposed KIPs being active. > > > >>>>>>> > > > >>>>>>> Full details in > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> LinkedIn: http://www.linkedin.com/in/claudewarren > > > >>>> > > > >>> > > > >> > > > >> > > > >> -- > > > >> LinkedIn: http://www.linkedin.com/in/claudewarren > > > >> > > > > > > > > > > > -- > LinkedIn: http://www.linkedin.com/in/claudewarren >