Donny Nadolny created KAFKA-19012:
-------------------------------------

             Summary: Messages ending up on the wrong topic
                 Key: KAFKA-19012
                 URL: https://issues.apache.org/jira/browse/KAFKA-19012
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.8.1, 3.2.3
            Reporter: Donny Nadolny


We're experiencing messages very occasionally ending up on a different topic 
than what they were published to. That is, we publish a message to topicA and 
consumers of topicB see it and fail to parse it because the message contents 
are meant for topicA. This has happened for various topics. 

We've begun adding a header with the intended topic (which we get just by 
reading the topic from the record that we're about to pass to the OSS client) 
right before we call producer.send, this header shows the correct topic (which 
also matches up with the message contents itself). Similarly we're able to use 
this header and compare it to the actual topic to prevent consuming these 
misrouted messages, but this is still concerning.

Some details:
 - This happens rarely: it happened approximately once per 10 trillion messages 
for a few months, though there was a period of a week or so where it happened 
more frequently (once per 1 trillion messages or so)
 - It often happens in a small burst, eg 2 or 3 messages very close in time 
(but from different hosts) will be misrouted
 - It often but not always coincides with some sort of event in the cluster (a 
broker restarting or being replaced, network issues causing errors, etc). Also 
these cluster events happen quite often with no misrouted messages
 - We run many clusters, it has happened for several of them
 - There is no pattern between intended and actual topic, other than the 
intended topic tends to be higher volume ones (but I'd attribute that to there 
being more messages published -> more occurrences affecting it rather than it 
being more likely per-message)
 - It only occurs with clients that are using a non-zero linger
 - Once it happened with two sequential messages, both were intended for topicA 
but both ended up on topicB, published by the same host (presumably within the 
same linger batch)
 - Most of our clients are 3.2.3 and it has only affected those, most of our 
brokers are 3.2.3 but it has also happened with a cluster that's running 3.8.1 
(but I suspect a client rather than broker problem because of it never 
happening with clients that use 0 linger)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to