[jira] [Created] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

John Fung (JIRA) Fri, 14 Sep 2012 12:39:11 -0700

John Fung created KAFKA-514:
-------------------------------

             Summary: Replication with Leader Failure Test: Log segment files 
checksum mismatch
                 Key: KAFKA-514
                 URL: https://issues.apache.org/jira/browse/KAFKA-514
             Project: Kafka
          Issue Type: Bug
            Reporter: John Fung



Test Description:

   1. Produce and consume messages to 1 topics and 3 partitions.
   2. This test sends 10 messages every 2 sec to 3 replicas.
   3. At the end verifies the log size and contents as well as using a consumer 
to verify that there is no message loss.

The issue:
When the leader is terminated by a controlled failure (kill -15), the resulting 
log segment files size are not all matching. The mismatch log segment size 
would happen in one of the partition of the terminated broker. This is 
consistently reproducible from the system regression test for replication with 
the following configurations:

    * zookeeper: 1-node (local)
    * brokers: 3-node cluster (all local)
    * replica factor: 3
    * no. of topic: 1
    * no. of partition: 2
    * iterations of leader failure: 1

Remarks:

    * It is rarely reproducible if the no. of partitions is 1.
    * Even the file checksums are not matching, the no. of messages in the 
producer & consumer logs are equal


Test result (shown with log file checksum):

broker-1 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across 
all replicas

broker-2 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 4068655384    <<<< not matching across 
all replicas

broker-3 :
test_1-0/00000000000000000000.kafka => 1690639555
test_1-1/00000000000000000000.kafka => 3530842923    <<<< not matching across 
all replicas

Errors:
The following error is found in the terminated leader:

[2012-09-14 11:07:05,217] WARN No previously checkpointed highwatermark value 
found for topic test_1 partition 1. Returning 0 as the highwatermark 
(kafka.server.HighwaterMarkCheckpoint)
[2012-09-14 11:07:05,220] ERROR Replica Manager on Broker 3: Error processing 
leaderAndISR request LeaderAndIsrRequest(1,,true,1000,Map((test_1,1) -> { 
"ISR": "1,2","leader": "1","leaderEpoch": "0" }, (test_1,0) -> { "ISR": "
1,2","leader": "1","leaderEpoch": "1" })) (kafka.server.ReplicaManager)
kafka.common.KafkaException: End index must be segment list size - 1
        at kafka.log.SegmentList.truncLast(SegmentList.scala:82)
        at kafka.log.Log.truncateTo(Log.scala:471)
        at kafka.cluster.Partition.makeFollower(Partition.scala:171)
        at kafka.cluster.Partition.makeLeaderOrFollower(Partition.scala:126)
        at 
kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeFollower(ReplicaManager.scala:195)
        at 
kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:154)
        at 
kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$2.apply(ReplicaManager.scala:144)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
        at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
        at scala.collection.Iterator$class.foreach(Iterator.scala:631)
        at 
scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
        at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
        at 
kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:144)
        at kafka.server.KafkaApis.handleLeaderAndISRRequest(KafkaApis.scala:73)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:40)
        at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (KAFKA-514) Replication with Leader Failure Test: Log segment files checksum mismatch

Reply via email to