[ 
https://issues.apache.org/jira/browse/KAFKA-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042477#comment-18042477
 ] 

Donny Nadolny edited comment on KAFKA-19012 at 12/3/25 12:27 PM:
-----------------------------------------------------------------

I can reproduce this issue, I've found what's causing it and I have a fix!

I've added a summary at the bottom so feel free to skip ahead, but this took so 
long to do that I have to tell the story.

 

(First, sorry for not replying earlier. I had tried to use that branch with 
extra instrumentation, but the version of the client isn't what we're using, I 
think it was what one of the other reporters of this issue used. And when I 
went to use it I found that because of complexity in our build system it's a 
lot tougher than it should be to use a fork, so I was hesitant to ask for it to 
be rebased to our client version until I knew I'd be able to use it.)
h2. The reproduction

Broadly the approach is to publish messages where the content of the message is 
the name of the topic it should be published to, run a consumer that checks all 
messages, and then induce failures in the brokers.

I set up a 5 broker cluster running 3.8.1 and clients running 3.3.1 which are 
the versions we've been using. I made a producer that produces randomly to 300 
different topics, sometimes one-off messages, sometimes small bursts, and with 
many threads using one KafkaProducer. The message content is always the name of 
the topic, and it adds a header with a unique number. Then there's a consumer 
that consumes from all these topics and verifies the name of the topic from the 
message matches the topic it was published to and prints out if there's a 
mismatch.

I had tried with 1 broker first, never got any mismatches, expanded to 5, still 
none, started semi-randomly burning CPU on the machines running the producer as 
well as adding scripted restarts of the brokers, and finally I was able to 
reproduce it: messages clearly ending up on the wrong topic, verified by the 
live consumer as well as with the console consumer.
h2. Narrowing it down

I had this cluster where it was happening with some regularity - sometimes a 
few hours would go by with no repro, and then there would be a few in close 
succession, but it was happening steadily. So I started testing different 
client versions to try to narrow it down, and I found it happens with the 
client version 2.8.0 or later, but not 2.7.2 or earlier.

Unfortunately there are hundreds of commits in between those two client 
versions. So I switched from using the released clients to a checkout of the 
code so I could run different commits (and fixed some build issues so that 
these old commits could still build). Binary searching them is inefficient 
because as I said sometimes hours could go by with no repro even though one 
might come later. I'm sure there's a cool stats technique to calculate out the 
optimal way of doing it, but I just tried a few different commits wildly apart 
and got started from the oldest one that had a repro, and during the day 
skipped small batches and waited to get a repro then during the night skipped 
way ahead. Eventually, I found the culprit: commit 
[30bc21ca35b165f04c472b4ce794893843809ccc|https://github.com/apache/kafka/commit/30bc21ca35b165f04c472b4ce794893843809ccc],
 which is a change to the client to "Replace Produce request/response with 
automated protocol".

Next problem: it's a 1000 line diff. First I read through it to see if I could 
spot the problem, no luck. 

I started modifying the files in the commit, slowly undoing the changes to try 
to get the minimal change that still repro'd. 

In that commit there’s a change to the protocol file ProduceRequest.json which 
switches from holding the records as bytes to holding them as record objects. I 
only narrowed it down to around a 125 line diff because to make that change 
there’s a whole bunch of other changes that have to happen in ProduceRequest 
and Sender to get it to compile and run. Even after poring over the generated 
serialization code (including a suspicious cache), there didn't seem to be any 
bug.

In this time, I set up 2 more clusters to try to make it easier to investigate, 
but it didn't work: several days in, neither cluster had a repro. I was stuck 
with one special cluster that would repro often, and a 2nd cluster that ended 
up with 1 repro in about a week, and a 3rd cluster that never had any repro.

I wasn't getting anywhere with the code, so I captured network traffic during 
one of the repros. With that, what I saw was the producer sending the right 
data to the broker, that broker replicating the right data out to followers and 
sending it to the consumer, but then also a broker sending out that same 
message batch but on the wrong topic to followers and the consumer (even though 
the client never sent bad data). So now it seemed like even though this bug 
only happened with some client versions and not others, that the bug was on the 
broker itself and the client was simply triggering a broker bug through 
different behavior of the client. Around this time there were some 
infrastructure changes and my test machines were terminated (they were on a 
sandbox/test account), which was very discouraging since I lost the special 
cluster which would repro the problem often. Plus all that time I spent 
studying client code seemed to be wasted. Then I also got busy at work, being 
pulled in to help with other things.
h2. Trying again

Fast forward to now, I'm switching jobs and don't know when I'll have time to 
investigate more, so I had to post the process I used to repro it and share 
what I know before I turn in my laptop. I couldn't help but look over my notes 
again and take one more shot at figuring it out. I set up another cluster, and 
I had no repros for 10 days despite trying some more tricks like adding packet 
loss / network latency, but suddenly it started reproing often. And in the time 
when it wasn't happening, I dug into the packet dump I had from before and 
found something: I had been using wireshark's built-in kafka message parsing 
when searching through the traffic, but when I switched to using plain "tcp 
contains ..." searches instead, I found a packet from the client to the broker 
with the bad publish to the wrong topic. It was malformed/corrupted which was 
why it didn't parse properly, so I deserialized it by hand. In it I found a 
produce request to one topic, several batches of messages to be produced to 
that topic, but at the start of the message data overwriting some of those 
correct produces was a batch of messages destined for a different topic. This 
showed definitively that it was the client sending bad data. From the place in 
the message that this bad data was overwriting the good data, it seemed likely 
that it was an entire MemoryRecords object that was being serialized to the 
wrong place.
h2. The problem

Now that I had an idea where the problem was happening, with some more code 
analysis and trying things out, I found the problem was with the usage of 
BufferPool - one of the (many) places Kafka does manual memory management. This 
BufferPool is used on the client side to allocate ByteBuffers, but as an 
optimization if you request ByteBufffers that are the size of a batch then it 
will reuse them.

In the normal path, what happens is we grab a buffer from the pool, serialize 
some records to it, the sender thread duplicates the ByteBuffer (which just 
gives another pointer to the same underlying memory), writes those bytes over 
the wire to the broker, and then we return the buffer to the pool.

I'm still not 100% sure on all the ways this can happen (racing against time 
due to the job switch, normally I'd like to understand even more first) but 
what happens is that we grab a ByteBuffer from the pool, we write our records 
to it, dupe it, start to send it over the wire, and then <something happens - 
in the cases I’ve seen it’s the client calling failBatch and ultimately failing 
the original correct publish with “Expiring xx record(s) for 
topic-whatever-0:yyyy ms has passed since batch creation” and the result for 
the publish that was overwritten is “The server disconnected before a response 
was received.” but there could be other ways> but the key is that the buffer 
will get returned to the pool even though the sender thread still has a 
reference to the underlying memory via the duplicate() call. Then another 
thread doing a publish gets handed the same ByteBuffer by the pool, it starts 
to serialize its records to that ByteBuffer, and then the original write over 
the network from earlier goes out but now it has some of the data overwritten, 
and that's how we get this misrouted publish.

Two other things to note are (1) that because the message checksum doesn't 
include the topic of the message, we're not protected by the checksum here. And 
(2) that when we serialize the data, we write out the topic name and partition 
in one chunk in the request header, and then the record data in another buffer. 
It's this buffer for the records that can get overwritten with records that 
were destined for another topic.

Here’s the flow and the difference from before/after for why this bug occurs 
only now with this change:

[same] in RecordAccumulator, called by KafkaProducer.send (called by the end 
user), we get a ByteBuffer from the pool and pass it to MemoryRecordsBuilder

in the Sender thread, Sender.runOnce/sendProducerData we call 
RecordAccumulator.drain which creates a MemoryRecords instance using that 
pooled ByteBuffer

in the Sender thread, later on in NetworkClient.doSend we call 
ProduceRequest.toSend to turn it in to a Send instance that we can write over 
the network

[before] in the old way, ProduceRequest.toSend called the parent class 
AbstractRequest.toSend which first calls RequestUtils.serialize which allocates 
a fresh buffer with ByteBuffer.allocate and serializes the whole request in to 
there. It is this fresh buffer which ends up being written over the network via 
a ByteBufferSend, and the original pooled ByteBuffer in MemoryRecords is no 
longer used

[after] in the new way ProduceRequest.toSend is implemented (instead of using 
the parent class implementation) which calls SendBuilder.buildRequestSend which 
ends up creating a MultiRecordsSend that contains DefaultRecordsSend that 
contains MemoryRecords which has the data stored in a pooled ByteBuffer. This 
MultiRecordSend is what gets told to write out to the network which ends up 
writing that pooled ByteBuffer, and because it can be reused before that write 
occurs this is where the corruption happens.
h2. What happens with the overwritten messages?

Because this only happens (as far as I know) in cases where the original 
publish had some failure, the client would have been given that failure result 
so we are not losing acknowledged writes. Instead, we end up with these 
misrouted publishes (which are actually duplicated) with one publish on the 
intended topic and one on a random recently used topic.
h2. Fixing it

There are a variety of fixes that could work. The simplest is a one line 
change, in BufferPool in the constructor we can set poolableSize to 0 to 
essentially disable pooling of ByteBuffers so we get a fresh one each time. 
I've tested this extensively and it works - with that one line change the 
problem never occurs, and I can swap back and forth between disabling the pool 
and having no repros for hours/days and then enabling it and getting repros, 
back and forth. This is absolutely the problem, though this fix does have a 
performance impact - the vast majority of allocations the BufferPool gets are 
for a poolable buffer size, and it is significantly faster to reuse the 
ByteBuffer than to allocate a fresh one, though the impact on overall 
performance is lower than it looks with microbenchmarking.

Another approach, which is my current suggestion, is to change Sender so that 
when we deallocate/free the pooled ByteBuffer, if we had a failed batch we 
should not return it to the pool and instead should let the JVM handle the 
lifecycle for that ByteBuffer and GC it when it’s not in use. We still need to 
tell the BufferPool that we are done with the ByteBuffer so it can track memory 
usage/limits (and we’re loosening the guarantees with that limit a bit - we’re 
saying we’re done with it when really a reference is held and it might still be 
going out over the network, but that ought to be very short-lived so I don’t 
think it’s a significant concern, if it is then I’d still likely suggest this 
approach but with some short delay before letting BufferPool count on that 
memory again). With this approach, we can keep the very fast allocation speed 
of pooled ByteBuffers, and we only pay the penalty of not being able to reuse 
them in fairly rare error cases.
h2. Raw notes

I’ve attached a text file with some very rough notes on the process for 
reproducing it as well as (again, very rough) notes taken while debugging 
including the corrupted message I captured.
h2. Summary

Client versions 2.8.0 and later are affected by a 
[change|https://github.com/apache/kafka/commit/30bc21ca35b165f04c472b4ce794893843809ccc]
 that exposes a latent bug in how BufferPool is used (BufferPool is a class 
used on the client side to allocate memory in ByteBuffers, for performance it 
will reuse them with the caller of the class doing manual memory management by 
calling free when they are done with the memory). The bug is that a pooled 
ByteBuffer can be freed while it is still in use by the network sending thread 
- this early freeing can happen when batches expire / brokers are disconnecting 
from clients. This bug has existed for more than a decade (since Kafka 0.x it 
seems), but never manifested because prior to 2.8.0 the pooled ByteBuffer 
(which contained record data aka your publishes) was copied into a freshly 
allocated ByteBuffer before any potential reuse and that fresh ByteBuffer was 
what got written over the network to the broker. With a change included in 
2.8.0, the pooled ByteBuffer remains as-is inside of a MemoryRecords instance 
and this pooled ByteBuffer (which in some cases can be reused and overwritten 
with other data) is written over the network. Two contributing factors are that 
the checksum for Kafka records only includes the key/value/headers/etc and not 
the topic so there is no protection there, and also an implementation detail is 
that, also newly in the commit that exposed the bug, the produce request header 
(which includes the topic and partition of a group of message batches) is 
serialized in a buffer separately from the messages themselves (and the latter 
is what gets put in the pooled ByteBuffer) which allows you to get messages 
misrouted to a random recently used topic as opposed to simple duplicate 
messages on their intended topic. I've fixed this by avoiding reuse of the 
pooled ByteBuffer in the cases where it can be deallocated while still in use 
to maintain the performance benefit of pooling in the normal code path, though 
an alternate solution would be to disable the pooling of ByteBuffers entirely.


was (Author: dnadolny):
I can reproduce this issue, I've found what's causing it and I have a fix!

I've added a summary at the bottom so feel free to skip ahead, but this took so 
long to do that I have to tell the story.

 

(First, sorry for not replying earlier. I had tried to use that branch with 
extra instrumentation, but the version of the client isn't what we're using, I 
think it was what one of the other reporters of this issue used. And when I 
went to use it I found that because of complexity in our build system it's a 
lot tougher than it should be to use a fork, so I was hesitant to ask for it to 
be rebased to our client version until I knew I'd be able to use it.)
h2. The reproduction

Broadly the approach is to publish messages where the content of the message is 
the name of the topic it should be published to, run a consumer that checks all 
messages, and then induce failures in the brokers.

I set up a 5 broker cluster running 3.8.1 and clients running 3.3.1 which are 
the versions we've been using. I made a producer that produces randomly to 300 
different topics, sometimes one-off messages, sometimes small bursts, and with 
many threads using one KafkaProducer. The message content is always the name of 
the topic, and it adds a header with a unique number. Then there's a consumer 
that consumes from all these topics and verifies the name of the topic from the 
message matches the topic it was published to and prints out if there's a 
mismatch.

I had tried with 1 broker first, never got any mismatches, expanded to 5, still 
none, started semi-randomly burning CPU on the machines running the producer as 
well as adding scripted restarts of the brokers, and finally I was able to 
reproduce it: messages clearly ending up on the wrong topic, verified by the 
live consumer as well as with the console consumer.
h2. Narrowing it down

I had this cluster where it was happening with some regularity - sometimes a 
few hours would go by with no repro, and then there would be a few in close 
succession, but it was happening steadily. So I started testing different 
client versions to try to narrow it down, and I found it happens with the 
client version 2.8.0 or later, but not 2.7.2 or earlier.

Unfortunately there are hundreds of commits in between those two client 
versions. So I switched from using the released clients to a checkout of the 
code so I could run different commits (and fixed some build issues so that 
these old commits could still build). Binary searching them is inefficient 
because as I said sometimes hours could go by with no repro even though one 
might come later. I'm sure there's a cool stats technique to calculate out the 
optimal way of doing it, but I just tried a few different commits wildly apart 
and got started from the oldest one that had a repro, and during the day 
skipped small batches and waited to get a repro then during the night skipped 
way ahead. Eventually, I found the culprit: commit 
[30bc21ca35b165f04c472b4ce794893843809ccc|https://github.com/apache/kafka/commit/30bc21ca35b165f04c472b4ce794893843809ccc],
 which is a change to the client to "Replace Produce request/response with 
automated protocol".

Next problem: it's a 1000 line diff. First I read through it to see if I could 
spot the problem, no luck. 

I started modifying the files in the commit, slowly undoing the changes to try 
to get the minimal change that still repro'd. 

In that commit there’s a change to the protocol file ProduceRequest.json which 
switches from holding the records as bytes to holding them as record objects. I 
only narrowed it down to around a 125 line diff because to make that change 
there’s a whole bunch of other changes that have to happen in ProduceRequest 
and Sender to get it to compile and run. Even after poring over the generated 
serialization code (including a suspicious cache), there didn't seem to be any 
bug.

In this time, I set up 2 more clusters to try to make it easier to investigate, 
but it didn't work: several days in, neither cluster had a repro. I was stuck 
with one special cluster that would repro often, and a 2nd cluster that ended 
up with 1 repro in about a week, and a 3rd cluster that never had any repro.

I wasn't getting anywhere with the code, so I captured network traffic during 
one of the repros. With that, what I saw was the producer sending the right 
data to the broker, that broker replicating the right data out to followers and 
sending it to the consumer, but then also a broker sending out that same 
message batch but on the wrong topic to followers and the consumer (even though 
the client never sent bad data). So now it seemed like even though this bug 
only happened with some client versions and not others, that the bug was on the 
broker itself and the client was simply triggering a broker bug through 
different behavior of the client. Around this time there were some 
infrastructure changes and my test machines were terminated (they were on a 
sandbox/test account), which was very discouraging since I lost the special 
cluster which would repro the problem often. Plus all that time I spent 
studying client code seemed to be wasted. Then I also got busy at work, being 
pulled in to help with other things.
h2. Trying again

Fast forward to now, I'm switching jobs and don't know when I'll have time to 
investigate more, so I had to post the process I used to repro it and share 
what I know before I turn in my laptop. I couldn't help but look over my notes 
again and take one more shot at figuring it out. I set up another cluster, and 
I had no repros for 10 days despite trying some more tricks like adding packet 
loss / network latency, but suddenly it started reproing often. And in the time 
when it wasn't happening, I dug into the packet dump I had from before and 
found something: I had been using wireshark's built-in kafka message parsing 
when searching through the traffic, but when I switched to using plain "tcp 
contains ..." searches instead, I found a packet from the client to the broker 
with the bad publish to the wrong topic. It was malformed/corrupted which was 
why it didn't parse properly, so I deserialized it by hand. In it I found a 
produce request to one topic, several batches of messages to be produced to 
that topic, but at the start of the message data overwriting some of those 
correct produces was a batch of messages destined for a different topic. This 
showed definitively that it was the client sending bad data. From the place in 
the message that this bad data was overwriting the good data, it seemed likely 
that it was an entire MemoryRecords object that was being serialized to the 
wrong place.
h2. The problem

Now that I had an idea where the problem was happening, with some more code 
analysis and trying things out, I found the problem was with the usage of 
BufferPool - one of the (many) places Kafka does manual memory management. This 
BufferPool is used on the client side to allocate ByteBuffers, but as an 
optimization if you request ByteBufffers that are the size of a batch then it 
will reuse them.

In the normal path, what happens is we grab a buffer from the pool, serialize 
some records to it, the sender thread duplicates the ByteBuffer (which just 
gives another pointer to the same underlying memory), writes those bytes over 
the wire to the broker, and then we return the buffer to the pool.

I'm still not 100% sure on all the ways this can happen (racing against time 
due to the job switch, normally I'd like to understand even more first) but 
what happens is that we grab a ByteBuffer from the pool, we write our records 
to it, dupe it, start to send it over the wire, and then <something happens - 
in the cases I’ve seen it’s the client calling failBatch and ultimately failing 
the original correct publish with “Expiring xx record(s) for 
topic-whatever-0:yyyy ms has passed since batch creation” and the result for 
the publish that was overwritten is “The server disconnected before a response 
was received.” but there could be other ways> but the key is that the buffer 
will get returned to the pool even though the sender thread still has a 
reference to the underlying memory via the duplicate() call. Then another 
thread doing a publish gets handed the same ByteBuffer by the pool, it starts 
to serialize its records to that ByteBuffer, and then the original write over 
the network from earlier goes out but now it has some of the data overwritten, 
and that's how we get this misrouted publish.

Two other things to note are (1) that because the message checksum doesn't 
include the topic of the message, we're not protected by the checksum here. And 
(2) that when we serialize the data, we write out the topic name and partition 
in one chunk in the request header, and then the record data in another buffer. 
It's this buffer for the records that can get overwritten with records that 
were destined for another topic.

Here’s the flow and the difference from before/after for why this bug occurs 
only now with this change:

[same] in RecordAccumulator, called by KafkaProducer.send (called by the end 
user), we get a ByteBuffer from the pool and pass it to MemoryRecordsBuilder

in the Sender thread, Sender.runOnce/sendProducerData we call 
RecordAccumulator.drain which creates a MemoryRecords instance using that 
pooled ByteBuffer

in the Sender thread, later on in NetworkClient.doSend we call 
ProduceRequest.toSend to turn it in to a Send instance that we can write over 
the network

[before] in the old way, ProduceRequest.toSend called the parent class 
AbstractRequest.toSend which first calls RequestUtils.serialize which allocates 
a fresh buffer with ByteBuffer.allocate and serializes the whole request in to 
there. It is this fresh buffer which ends up being written over the network via 
a ByteBufferSend, and the original pooled ByteBuffer in MemoryRecords is no 
longer used

[after] in the new way ProduceRequest.toSend is implemented (instead of using 
the parent class implementation) which calls SendBuilder.buildRequestSend which 
ends up creating a MultiRecordsSend that contains DefaultRecordsSend that 
contains MemoryRecords which has the data stored in a pooled ByteBuffer. This 
MultiRecordSend is what gets told to write out to the network which ends up 
writing that pooled ByteBuffer, and because it can be reused before that write 
occurs this is where the corruption happens.
h2. What happens with the overwritten messages?

Because this only happens (as far as I know) in cases where the original 
publish had some failure, the client would have been given that failure result 
so we are not losing acknowledged writes. Instead, we end up with these 
misrouted publishes (which are actually duplicated) with one publish on the 
intended topic and one on a random recently used topic.
h2. Fixing it

There are a variety of fixes that could work. The simplest is a one line 
change, in BufferPool in the constructor we can set poolableSize to 0 to 
essentially disable pooling of ByteBuffers so we get a fresh one each time. 
I've tested this extensively and it works - with that one line change the 
problem never occurs, and I can swap back and forth between disabling the pool 
and having no repros for hours/days and then enabling it and getting repros, 
back and forth. This is absolutely the problem, though this fix does have a 
performance impact - the vast majority of allocations the BufferPool gets are 
for a poolable buffer size, and it is significantly faster to reuse the 
ByteBuffer than to allocate a fresh one, though the impact on overall 
performance is lower than it looks with microbenchmarking.

Another approach, which is my current suggestion, is to change Sender so that 
when we deallocate/free the pooled ByteBuffer, if we had a failed batch we 
should not return it to the pool and instead should let the JVM handle the 
lifecycle for that ByteBuffer and GC it when it’s not in use. We still need to 
tell the BufferPool that we are done with the ByteBuffer so it can track memory 
usage/limits (and we’re loosening the guarantees with that limit a bit - we’re 
saying we’re done with it when really a reference is held and it might still be 
going out over the network, but that ought to be very short-lived so I don’t 
think it’s a significant concern, if it is then I’d still likely suggest this 
approach but with some short delay before letting BufferPool count on that 
memory again). With this approach, we can keep the very fast allocation speed 
of pooled ByteBuffers, and we only pay the penalty of not being able to reuse 
them in fairly rare error cases.
h2. Raw notes

I’ve attached a text file with some very rough notes on the process for 
reproducing it as well as (again, very rough) notes taken while debugging 
including the corrupted message I captured.
h2. Summary

Client versions 2.8.0 and later are affected by a 
[change|https://github.com/apache/kafka/commit/30bc21ca35b165f04c472b4ce794893843809ccc]
 that exposes a latent bug in how BufferPool is used (BufferPool is a class 
used on the client side to allocate memory in ByteBuffers, for performance it 
will reuse them with the caller of the class doing manual memory management by 
calling free when they are done with the memory). The bug is that a pooled 
ByteBuffer can be freed while it is still in use by the network sending thread 
- this early freeing can happen when batches expire / brokers are disconnecting 
from clients. This bug has existed for more than a decade (since Kafka 0.x it 
seems), but never manifested because prior to 2.8.0 the pooled ByteBuffer 
(which contained record data aka your publishes) was copied into a freshly 
allocated ByteBuffer before any potential reuse and that fresh ByteBuffer was 
what got written over the network to the broker. With a change included in 
2.8.0, the pooled ByteBuffer remains as-is inside of a MemoryRecords instance 
and this pooled ByteBuffer (which in some cases can be reused and overwritten 
with other data) is written over the network. Two contributing factors are that 
the checksum for Kafka records only includes the key/value/headers/etc and not 
the topic so there is no protection there, and also an implementation detail is 
that, also newly in the commit that exposed the bug, the produce request header 
(which includes the topic and partition of a group of message batches) is 
serialized in a buffer separately from the messages themselves (and the latter 
is what gets put in the pooled ByteBuffer) which allows you to get messages 
misrouted to a random recently used topic as opposed to simple duplicate 
messages on their intended topic.

> Messages ending up on the wrong topic
> -------------------------------------
>
>                 Key: KAFKA-19012
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19012
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, producer 
>    Affects Versions: 3.2.3, 3.8.1
>            Reporter: Donny Nadolny
>            Assignee: Kirk True
>            Priority: Blocker
>         Attachments: image-2025-08-06-13-34-30-830.png, rawnotes.txt
>
>
> We're experiencing messages very occasionally ending up on a different topic 
> than what they were published to. That is, we publish a message to topicA and 
> consumers of topicB see it and fail to parse it because the message contents 
> are meant for topicA. This has happened for various topics. 
> We've begun adding a header with the intended topic (which we get just by 
> reading the topic from the record that we're about to pass to the OSS client) 
> right before we call producer.send, this header shows the correct topic 
> (which also matches up with the message contents itself). Similarly we're 
> able to use this header and compare it to the actual topic to prevent 
> consuming these misrouted messages, but this is still concerning.
> Some details:
>  - This happens rarely: it happened approximately once per 10 trillion 
> messages for a few months, though there was a period of a week or so where it 
> happened more frequently (once per 1 trillion messages or so)
>  - It often happens in a small burst, eg 2 or 3 messages very close in time 
> (but from different hosts) will be misrouted
>  - It often but not always coincides with some sort of event in the cluster 
> (a broker restarting or being replaced, network issues causing errors, etc). 
> Also these cluster events happen quite often with no misrouted messages
>  - We run many clusters, it has happened for several of them
>  - There is no pattern between intended and actual topic, other than the 
> intended topic tends to be higher volume ones (but I'd attribute that to 
> there being more messages published -> more occurrences affecting it rather 
> than it being more likely per-message)
>  - It only occurs with clients that are using a non-zero linger
>  - Once it happened with two sequential messages, both were intended for 
> topicA but both ended up on topicB, published by the same host (presumably 
> within the same linger batch)
>  - Most of our clients are 3.2.3 and it has only affected those, most of our 
> brokers are 3.2.3 but it has also happened with a cluster that's running 
> 3.8.1 (but I suspect a client rather than broker problem because of it never 
> happening with clients that use 0 linger)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to