I agree with Justine, especially considering that the number of producers on a
Kafka cluster is usually very limited.
This makes me think we are focusing on the symptom in my KIP-936 and this KIP,
which is the memory issue, instead of addressing the root cause.
The root cause is that the idempotent session protocol tolerates and allows
short-lived session metadata (PID and state) to crowd the cluster.
So, similar to Justine, I am more in favour of considering a new protocol that
offers true idempotent producer capabilities across sessions, that can also
truly identify the producer client.
However, if we can address the memory issue in the meantime with a simple
solution to protect against the anti-patterns/misconfigured use-cases of the
idempotent session protocol,
this would be a win until we come up with a new protocol.
So far, both KIP-936 and this KIP propose somewhat complicated solutions to
this symptom.
Maybe we should divide the focus into:
• Finding a simple way to protect against OOM caused by short-lived
idempotent sessions. This might be a bit complicated as identifying short-lived
producers is tricky with the current protocol.
For instance, we could revisit the reject alternative in KIP-936 to
throttle INIT_PID requests. It is not perfect, but it is the simplest.
• Developing Idempotent Protocol V2 that addresses this issue and the
client-side issues with idempotent producers. @Justine are you aware of anyone
looking into such protocol at the moment in details?
Omnia
> On 17 May 2024, at 16:26, Justine Olshan <[email protected]> wrote:
>
> Respectfully, I don't agree. Why should we persist useless information
> for clients that are long gone and will never use it?
> This is why I'm suggesting we do something smarter when it comes to storing
> data and only store data we actually need and have a use for.
>
> This is why I suggest the heartbeat. It gives us clear information (up to
> the heartbeat interval) of which producers are worth keeping and which that
> are not.
> I'm not in favor of building a new and complicated system to try to guess
> which information is needed. In my mind, if we have a ton of legitimately
> active producers, we should scale up memory. If we don't there is no reason
> to have high memory usage.
>
> Fixing the client also allows us to fix some of the other issues we have
> with idempotent producers.
>
> Justine
>
> On Fri, May 17, 2024 at 12:46 AM Claude Warren <[email protected]> wrote:
>
>> I think that the point here is that the design that assumes that you can
>> keep all the PIDs in memory for all server configurations and all usages
>> and all client implementations is fraught with danger.
>>
>> Yes, there are solutions already in place (KIP-854) that attempt to address
>> this problem, and other proposed solutions to remove that have undesirable
>> side effects (e.g. Heartbeat interrupted by IP failure for a slow producer
>> with a long delay between posts). KAFKA-16229 (Slow expiration of Producer
>> IDs leading to high CPU usage) dealt with how to expire data from the cache
>> so that there was minimal lag time.
>>
>> But the net issue is still the underlying design/architecture.
>>
>> There are a couple of salient points here:
>>
>> - The state of a state machine is only a view on its transactions. This
>> is the classic stream / table dichotomy.
>> - What the "cache" is trying to do is create that view.
>> - In some cases the size of the state exceeds the storage of the cache
>> and the systems fail.
>> - The current solutions have attempted to place limits on the size of
>> the state.
>> - Errors in implementation and or configuration will eventually lead to
>> "problem producers"
>> - Under the adopted fixes and current slate of proposals, the "problem
>> producers" solutions have cascading side effects on properly behaved
>> producers. (e.g. dropping long running, slow producing producers)
>>
>> For decades (at least since the 1980's and anecdotally since the 1960's)
>> there has been a solution to processing state where the size of the state
>> exceeded the memory available. It is the solution that drove the idea that
>> you could have tables in Kafka. The idea that we can store the hot PIDs in
>> memory using an LRU and write data to storage so that we can quickly find
>> things not in the cache is not new. It has been proven.
>>
>> I am arguing that we should not throw away state data because we are
>> running out of memory. We should persist that data to disk and consider
>> the disk as the source of truth for state.
>>
>> Claude
>>
>>
>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan
>> <[email protected]>
>> wrote:
>>
>>> +1 to the comment.
>>>
>>>> I still feel we are doing all of this only because of a few
>> anti-pattern
>>> or misconfigured producers and not because we have “too many Producer”.
>> I
>>> believe that implementing Producer heartbeat and remove short-lived PIDs
>>> from the cache if we didn’t receive heartbeat will be more simpler and
>> step
>>> on right direction to improve idempotent logic and maybe try to make PID
>>> get reused between session which will implement a real idempotent
>> producer
>>> instead of idempotent session. I admit this wouldn’t help with old
>> clients
>>> but it will put us on the right path.
>>>
>>> This issue is very complicated and I appreciate the attention on it.
>>> Hopefully we can find a good solution working together :)
>>>
>>> Justine
>>>
>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <[email protected]>
>>> wrote:
>>>
>>>> Also in the rejection alternatives you listed an approved KIP which is
>> a
>>>> bit confusing can you move this to motivations instead
>>>>
>>>>> On 15 May 2024, at 14:35, Claude Warren <[email protected]> wrote:
>>>>>
>>>>> This is a proposal that should solve the OOM problem on the servers
>>>> without
>>>>> some of the other proposed KIPs being active.
>>>>>
>>>>> Full details in
>>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
>>>>
>>>>
>>>
>>
>>
>> --
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>