[ 
https://issues.apache.org/jira/browse/KAFKA-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895448#comment-17895448
 ] 

Chia-Ping Tsai commented on KAFKA-17862:
----------------------------------------

[~bharathv] sorry for late response. 

{quote}
Race is probably a mischaracterization, it is exactly as you described but IIUC 
it may take multiple poll()s for the data to be actually written?  Say the 
sequence is
{quote}

You're correct that we only remove the expired batch from the sender, so the 
network client can still send it out. However, this case should be rare since 
the default delivery.timeout is 120s and request.timeout is 30s. This means the 
request will usually expire before the batch does.

{quote}
This issue can be easily reproduced in a simulated environment that triggers 
frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
throughput that can cause longer queues (hence higher chances of expiration) 
and frequent buffer reuse from the pool (deadly combination )
{quote}

yes, this bug can happen if both timeouts are equal and small. 

It’s challenging to expire a request based on a batch, as a produce request can 
contain multiple batches. Perhaps we should avoid expiring in-flight batches 
altogether. The side effect is that a batch timeout may not fully honor 
delivery.timeout.ms. However, I believe corrupted data is even less acceptable.


> [buffer pool] corruption during buffer reuse from the pool
> ----------------------------------------------------------
>
>                 Key: KAFKA-17862
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17862
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, core, producer 
>    Affects Versions: 3.7.1
>            Reporter: Bharath Vissapragada
>            Priority: Major
>         Attachments: client-config.txt
>
>
> We noticed malformed batches from the Kafka Java client + Redpanda under 
> certain conditions that caused excessive client retries and we narrowed it 
> down to a client bug related to corruption of buffers reused from the buffer 
> pool. We were able to reproduce it with Kafka brokers too, so we are fairly 
> certain the bug is on the client.
> (Attached the full client config, fwiw)
> We narrowed it down to a race condition between produce requests and failed 
> batch expiration. If the network flush of produce request races with the 
> expiration, the produce batch that the request uses is corrupted, so a 
> malformed batch is sent to the broker.
> The expiration is triggered by a timeout 
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]
> that eventually deallocates the batch
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]
> adding it back to the buffer pool
> [https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]
> Now it is probably all zeroed out or there is a competing producer that 
> requests a new append that reuses this freed up buffer and starts writing to 
> it corrupting it's contents.
> If there is racing network flush of a produce batch backed by this buffer, a 
> corrupt batch is sent to the broker resulting in a CRC mismatch. 
> This issue can be easily reproduced in a simulated environment that triggers 
> frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
> throughput that can cause longer queues (hence higher chances of expiration) 
> and frequent buffer reuse from the pool (deadly combination :))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to