Ryan Berdeen created KAFKA-1670:
-----------------------------------

             Summary: Corrupt log files for segment.bytes values close to 
Int.MaxInt
                 Key: KAFKA-1670
                 URL: https://issues.apache.org/jira/browse/KAFKA-1670
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.8.1.1
            Reporter: Ryan Berdeen
            Priority: Blocker


The maximum value for the topic-level config {{segment.bytes}} is 
{{Int.MaxInt}} (2147483647). *Using this value causes brokers to corrupt their 
log files, leaving them unreadable.*

We set {{segment.bytes}} to {{2122317824}} which is well below the maximum. One 
by one, the ISR of all partitions shrunk to 1. Brokers would crash when 
restarted, attempting to read from a negative offset in a log file. After 
discovering that many segment files had grown to 4GB or more, we were forced to 
shut down our *entire production Kafka cluster* for several hours while we 
split all segment files into 1GB chunks.

Looking into the {{kafka.log}} code, the {{segment.bytes}} parameter is used 
inconsistently. It is treated as a *soft* maximum for the size of the segment 
file 
(https://github.com/apache/kafka/blob/0.8.1.1/core/src/main/scala/kafka/log/LogConfig.scala#L26)
 with logs rolled only after 
(https://github.com/apache/kafka/blob/0.8.1.1/core/src/main/scala/kafka/log/Log.scala#L246)
 they exceed this value. However, much of the code that deals with log files 
uses *ints* to store the size of the file and the position in the file. 
Overflow of these ints leads the broker to append to the segments indefinitely, 
and to fail to read these segments for consuming or recovery.

This is trivial to reproduce:

{code}
$ bin/kafka-topics.sh --topic segment-bytes-test --create --replication-factor 
2 --partitions 1 --zookeeper zkhost:2181
$ bin/kafka-topics.sh --topic segment-bytes-test --alter --config 
segment.bytes=2147483647 --zookeeper zkhost:2181
$ yes "Int.MaxValue is a ridiculous bound on file size in 2014" | 
bin/kafka-console-producer.sh --broker-list localhost:6667 zkhost:2181 --topic 
segment-bytes-test
{code}

After running for a few minutes, the log file is corrupt:

{code}
$ ls -lh data/segment-bytes-test-0/
total 9.7G
-rw-r--r-- 1 root root  10M Oct  3 19:39 00000000000000000000.index
-rw-r--r-- 1 root root 9.7G Oct  3 19:39 00000000000000000000.log
{code}

We recovered the data from the log files using a simple Python script: 
https://gist.github.com/also/9f823d9eb9dc0a410796



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to