[ https://issues.apache.org/jira/browse/KAFKA-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531525#comment-13531525 ]
Jay Kreps commented on KAFKA-669: --------------------------------- To close the loop on this there are multiple issues here: 1. We are not cleaning up .index files present in the directory. This is caused because of the nature of Neha's test script 2. We are not fixing this problem in recovery because, although the recovery logic is reasonably well tested, the variable passed into the log by log manager was reversed. As a result we were running recovery whenever there was a clean shutdown and not running recovery if there was an unclean shutdown. d'oh. Patch on KAFKA-673. 3. The final question was why doesn't the truncateTo logic reset the size of the corrupt index to 0? I looked into this and here is the sequence of events in truncateTo(0). First we need to translate 0 to a file position, because the log is empty this translates to null which indicates 0 is earlier than the latest offset in the log (this is true). As a result there is no truncation to do on the log. It is assumed that the index can't have entries not in the log. This assumption is violated by our having an arbitrarily corrupted index. I think this is actually a valid assumption--if there is nothing in the log then there should be nothing in the index and we can skip it. The problem was that we had a corrupt index which KAFKA-670 and KAFKA-673 fix. > Irrecoverable error on leader while rolling to a new segment > ------------------------------------------------------------ > > Key: KAFKA-669 > URL: https://issues.apache.org/jira/browse/KAFKA-669 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Neha Narkhede > Priority: Blocker > Labels: bugs > Fix For: 0.8 > > > I setup a test cluster of 5 brokers and setup migration tool to send data for > ~1000 partitions to this cluster. I see a bunch of > FailedToSendMessageExceptions on the producer. After tracing some of these > failed requests, I found that the broker sends a ProducerResponseStatus(-1, > -1) to the producer. The impact is the producer marks the messages for those > partitions as failed, retries and eventually gives up. This seems like a bug > in the truncation logic. I traced the state changes topic SpamReportEvent and > seems like first it was the follower, so it truncated its log. Then it became > the leader and got into an irrecoverable error state due to log rolling. > [2012-12-12 01:47:01,951] INFO [Log Manager on Broker 3] Loading log > 'SpamReportEvent-0' (kafka.log.LogManager) > [2012-12-12 01:47:01,951] INFO Creating log segment > /mnt/u001/kafka_08_long_running_test/kafka-logs/SpamReportEvent-0/00000000000000000000.log > (kafka.log.FileMessageSet) > [2012-12-12 01:47:01,951] INFO Created index file > /mnt/u001/kafka_08_long_running_test/kafka-logs/SpamReportEvent-0/00000000000000000000.index > with maxEntries = 1310720, maxIndexSize = 10485760, entries = 1310720, > lastOffset = 0 (kafka.log.OffsetIndex) > [2012-12-12 01:47:02,949] INFO Replica Manager on Broker 3: Starting the > follower state transition to follow leader 4 for topic SpamReportEvent > partition 0 (kafka.server.ReplicaManager) > [2012-12-12 01:47:02,950] INFO [Kafka Log on Broker 3], Truncated log segment > /mnt/u001/kafka_08_long_running_test/kafka-logs/SpamReportEvent-0/00000000000000000000.log > to target offset 0 (kafka.log.Log) > [2012-12-12 01:47:34,745] INFO Replica Manager on Broker 3: Becoming Leader > for topic [SpamReportEvent] partition [0] (kafka.server.ReplicaManager) > [2012-12-12 01:47:34,745] INFO Replica Manager on Broker 3: Completed the > leader state transition for topic SpamReportEvent partition 0 > (kafka.server.ReplicaManager) > [2012-12-12 01:47:36,071] INFO Replica Manager on Broker 3: Becoming Leader > for topic [SpamReportEvent] partition [0] (kafka.server.ReplicaManager) > [2012-12-12 01:47:36,071] INFO Replica Manager on Broker 3: Completed the > leader state transition for topic SpamReportEvent partition 0 > (kafka.server.ReplicaManager) > [2012-12-12 01:48:10,545] INFO Replica Manager on Broker 3: Becoming Leader > for topic [SpamReportEvent] partition [0] (kafka.server.ReplicaManager) > [2012-12-12 01:48:10,545] INFO Replica Manager on Broker 3: Completed the > leader state transition for topic SpamReportEvent partition 0 > (kafka.server.ReplicaManager) > [2012-12-12 02:02:08,148] ERROR [KafkaApi-3] Error processing ProducerRequest > on SpamReportEvent:0 (kafka.server.Kafka > Apis) > [2012-12-12 02:06:06,081] ERROR [KafkaApi-3] Error processing ProducerRequest > on SpamReportEvent:0 (kafka.server.KafkaApis) > kafka.common.KafkaException: Trying to roll a new log segment for topic > partition SpamReportEvent-0 with start offset 0 while it already exists > at kafka.log.Log.rollToOffset(Log.scala:465) > at kafka.log.Log.roll(Log.scala:443) > at kafka.log.Log.maybeRoll(Log.scala:431) > at kafka.log.Log.append(Log.scala:258) > at > kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:246) > at > kafka.server.KafkaApis$$anonfun$appendToLocalLog$2.apply(KafkaApis.scala:239) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) > at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:206) > at scala.collection.immutable.Map$Map1.map(Map.scala:93) > at kafka.server.KafkaApis.appendToLocalLog(KafkaApis.scala:239) > at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:179) > at kafka.server.KafkaApis.handle(KafkaApis.scala:56) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:41) > at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira