Bharath Vissapragada created KAFKA-17862:
--------------------------------------------
Summary: [buffer pool] corruption during buffer reuse from the pool
Key: KAFKA-17862
URL: https://issues.apache.org/jira/browse/KAFKA-17862
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.7.1
Reporter: Bharath Vissapragada
Attachments: client-config.txt
We noticed malformed batches from the Kafka Java client + Redpanda under
certain conditions that caused excessive client retries and we narrowed it down
to a client bug related to corruption of buffers reused from the buffer pool.
We were able to reproduce it with Kafka brokers too, so we are fairly certain
the bug is on the client.
(Attached the full client config, fwiw)
We narrowed it down to a race condition between produce requests and failed
batch expiration. If the network flush of produce request races with the
expiration, the produce batch that the request uses is corrupted, so a
malformed batch is sent to the broker.
The expiration is triggered by a timeout
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]
that eventually deallocates the batch
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]
adding it back to the buffer pool
[https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]
Now it is probably all zeroed out or there is a competing producer that
requests a new append that reuses this freed up buffer and starts writing to it
corrupting it's contents.
If there is racing network flush of a produce batch backed with this buffer, a
corrupt batch is sent to the broker resulting in a CRC mismatch.
This issue can be easily reproduced in a simulated environment that triggers
frequent timeouts (eg: lower timeouts) and then use a producer with high-ish
throughput.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)