Hi Claude Thanks for raising this KIP. It is an interesting idea. I had a quick review for the KIP and I have few notes 10. > The issue is that the number of PIDs that need to be tracked has exploded and > has resulted in OOM failures that cause the entire cluster to crash. There > are multiple efforts underway to mitigate the OOM problem through cache > cleanup and throttling of clients.
I think we should clarify here that this only happened when the cluster has an abusive/misconfigured client that initialises too many PIDs. For example I saw this issue 3 times 1. First one was because of an application that kept re-initalizing producer on every single error message they received from Kafka instead of retrying or skipping the records. This one took longer to fill the memory ~ 24hr (this was before the 24hr expiration) but eventually it did. 2. one producer deployment stuck in crashing loop which created >500,000 PID in few hours due to some misconfiguration that led the application to crash after sending the first batch 3. another encounter was a producer initialising PID on each record which led to creation of 1M PID in few hours which is an anti-pattern. So technically this OOM only happened when we get small number of misconfigured producers or anti-pattern design. 11. Another thing is maybe worth pointing out here is KIP-936 as throttling is the other option we are weighting against in this KIP. 12. I feel the motivation isn’t clear enough for people who aren’t familiar with this OOM issue. Especially that not a lot of people experienced this issue. 13. I still feel we are doing all of this only because of a few anti-pattern or misconfigured producers and not because we have “too many Producer”. I believe that implementing Producer heartbeat and remove short-lived PIDs from the cache if we didn’t receive heartbeat will be more simpler and step on right direction to improve idempotent logic and maybe try to make PID get reused between session which will implement a real idempotent producer instead of idempotent session. I admit this wouldn’t help with old clients but it will put us on the right path. Omnia > On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> wrote: > > This is a proposal that should solve the OOM problem on the servers without > some of the other proposed KIPs being active. > > Full details in > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation