[
https://issues.apache.org/jira/browse/KAFKA-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirk True reassigned KAFKA-17862:
---------------------------------
Assignee: xuanzhang gong (was: Kirk True)
> [buffer pool] corruption during buffer reuse from the pool
> ----------------------------------------------------------
>
> Key: KAFKA-17862
> URL: https://issues.apache.org/jira/browse/KAFKA-17862
> Project: Kafka
> Issue Type: Bug
> Components: clients, core, producer
> Affects Versions: 3.7.1
> Reporter: Bharath Vissapragada
> Assignee: xuanzhang gong
> Priority: Blocker
> Attachments: client-config.txt
>
>
> We noticed malformed batches from the Kafka Java client + Redpanda under
> certain conditions that caused excessive client retries and we narrowed it
> down to a client bug related to corruption of buffers reused from the buffer
> pool. We were able to reproduce it with Kafka brokers too, so we are fairly
> certain the bug is on the client.
> (Attached the full client config, fwiw)
> We narrowed it down to a race condition between produce requests and failed
> batch expiration. If the network flush of produce request races with the
> expiration, the produce batch that the request uses is corrupted, so a
> malformed batch is sent to the broker.
> The expiration is triggered by a timeout
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]
> that eventually deallocates the batch
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]
> adding it back to the buffer pool
> [https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]
> Now it is probably all zeroed out or there is a competing producer that
> requests a new append that reuses this freed up buffer and starts writing to
> it corrupting it's contents.
> If there is racing network flush of a produce batch backed by this buffer, a
> corrupt batch is sent to the broker resulting in a CRC mismatch.
> This issue can be easily reproduced in a simulated environment that triggers
> frequent timeouts (eg: lower timeouts) and then use a producer with high-ish
> throughput that can cause longer queues (hence higher chances of expiration)
> and frequent buffer reuse from the pool (deadly combination :))
--
This message was sent by Atlassian Jira
(v8.20.10#820010)