[ https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034878#comment-15034878 ]
Jason Gustafson commented on KAFKA-2908: ---------------------------------------- After inspecting the broker data logs, I found that there are two kinds of gaps in the consumer's output. The first are gaps from messages which were not successfully written to the server. All of the remaining gaps occur at the tail of partition 1 of the test topic. This suggests that the consumer stopped sending fetch for that partition for a prolonged period of time which extended to the end of the test. This could happen either because the current connection did not close cleanly (which would cause a delay as long as the request timeout), or the consumer was unable to find the leader for the partition. I'm trying to get as much info out of the data here, but we may need to reproduce the error with more logging enabled to narrow the problem down further. The good news is that I have not found any cases where the consumer actually skipped over messages. > Another, possibly different, Gap in Consumption after Restart > ------------------------------------------------------------- > > Key: KAFKA-2908 > URL: https://issues.apache.org/jira/browse/KAFKA-2908 > Project: Kafka > Issue Type: Sub-task > Components: consumer > Reporter: Ben Stopford > Assignee: Jason Gustafson > Attachments: 2015-11-28--001.tar.gz > > > *Context:* > Instance of the rolling upgrade test. 10s sleeps have been put around node > restarts to ensure stability when subsequent nodes go down. Consumer timeout > has been set to 60s. Proudcer has been throttled to 100 messages per second. > Failure is rare: It occurred once in 60 executions (6 executions per run #276 > -> #285 of the system_test_branch_builder) > *Reported Failure:* > At least one acked message did not appear in the consumed messages. > acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, > 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, > 16442, ...plus 1669 more > *Immediate Observations:* > * The list of messages not consumed are all in partition 1. > * Production and Consumption continues throughout the test (there is no > complete write or read failure as we have seen elsewhere) > * The messages ARE present in the data files: > e.g. > {quote} > Examining missing value 16385. This was written to p1,offset:5453 > => there is an entry in all data files for partition 1 for this (presumably > meaning it was replicated correctly) > kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files > worker10/kafka-data-logs/test_topic-1/00000000000000000000.log | grep > 'offset: 5453' > worker10,9,8: > offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 > compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385 > => ID 16385 is definitely present in the Kafka logs suggesting a problem > consumer-side > {quote} > These entries definitely do not appear in the consumer stdout file either. > *Timeline:* > {quote} > 15:27:02,232 - Producer sends first message > ..servers 1, 2 are restarted (clean shutdown) > 15:29:42,718 - Server 3 shutdown complete > 15:29:42,712 - (Controller fails over): Broker 2 starting become controller > state transition (kafka.controller.KafkaController) > 15:29:42,743 - New leasder is 2 (LeaderChangeListner) > 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 > with correlation id 0 epoch 7 for partition (test_topic,1) since its > associated leader epoch 8 is old. Current leader epoch is 8 > (state.change.logger) > 15:29:45,642 - Producer starts writing messages that are never consumed > 15:30:10,804 - Last message sent by producer > 15:31:10,983 - Consumer times out after 60s wait without messages > {quote} > Logs for this run are attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)