[ https://issues.apache.org/jira/browse/KAFKA-7626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687330#comment-16687330 ]
Noam Berman commented on KAFKA-7626: ------------------------------------ I set the kafka consumer/producer logs to TRACE so we can have a better understanding of the issue. Another run of the test discovered a similar issue (same configs for all consumers/producers/brokers as stated in bug description). the attached file partial.log is an excerpt from the attached full log, it shows the relevant log entries for this scenario: * consumer-1 is assigned to all partitions at the beginning. we start by bringing down 1 broker. the message "3" is sent to topic0 on offset 4 (key="3", value="3", offset=4). * it starts a transaction with test-producer-1, produces a message to topic1 (just echo the message, same key/value), and on AddOffsetsToTxnRequest it gets stuck for 15 seconds. at this point the broker has finished going down, and it comes back up. * the consumers are configured with *max.poll.latency=5000*, so they rebalance after 5 seconds, while consumer-1 is stuck without polling. * consumer-4 gets partition 3 now, and it receives the message on offset 4 since we didn't finish the transaction previously. it sends the message to topic1, commits the offset and the transaction using test-producer-2. * at this point we've handled offset 4 once. * some seconds after, test-producer-1 receives an error for AddOffsetsToTxnResponse, followed by a successful AddOffsetsToTxnResponse. * since there's no exception, it proceeds to complete the initial transaction, and succeeds. at this point we've handled [topic0 partition 3 offset 4] two times, breaking exactly-once semantics. > Possible duplicate message delivery with exactly-once semantics > --------------------------------------------------------------- > > Key: KAFKA-7626 > URL: https://issues.apache.org/jira/browse/KAFKA-7626 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 1.1.0 > Environment: google cloud build docker, all brokers, consumers and > producers running on the same container (this is a test, not production). > Reporter: Noam Berman > Priority: Major > Labels: exactly-once > Attachments: full.tar.gz, partial.log > > > Hello, > I've come across an issue with exactly-once processing (running kafka 1.1.0): > In my test I bring up 3 brokers, and I start sending messages to `topicX`. > While I'm sending messages, I bring up a few consumers on `topicX` one at a > time (all with the same group id) - and they produce the same message to > `topicY`. At some point I bring one broker down and up again, to check > resiliency to failures. > Eventually I assert that `topicY` contains exactly the messages sent to > `topicX`. > This usually works as expected, but when running the same test 1000s of times > to check for flakiness, some of them act as follows (in this order): > 1. Consumer `C1` owns partition `p`. > 1a. Consumers rebalance occurs (because one of the new consumers is starting). > 1b. Consumer `C1` is revoked and then re-assigned partition `p`. > 2. One of the 3 brokers starts controlled shutdown. > 3. Consumer `C1` uses a transactional producer to send a message on offset > `o`. > 4. Consumer `C1` sends offset `o+1` for partition `p` to the transaction. > 5. Consumer `C1` successfully commits the message. > 6. Broker controlled shutdown finishes successfully. > ... a few seconds after... > 7. Another consumer rebalance occurs, Consumer `C2` is assigned to partition > `p`. > 8. Consumer `C2` polls message on offset `o` for partition `p`. > This means we do double processing for the message on offset `o`, violating > exactly-once semantics. > So it looks like during broker restart, a commit to the transactional > producer gets lost - and because we rebalance after that before another > commit happened, we actually poll the same message again, although previously > committed. > The brokers are configured with: > `transaction.state.log.min.isr=2` > `transaction.state.log.replication.factor=3` > `offsets.topic.replication.factor=3` > The consumer is configured with > `isolation.level=read_committed` > The original producer to `topicX` has transactional semantics, and the test > shows that it didn't send double messages (using idempodent producer config). > > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)