[
https://issues.apache.org/jira/browse/KAFKA-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154532#comment-15154532
]
Johannes Huning commented on KAFKA-3240:
----------------------------------------
Let me add some more information to this as I have investigated the issue as
well:
For one of the log files (/var/db/kafka/test6-1/00000000000000000000.log) the
output of DumpLogSegments is as follows:
{code}
$ /usr/local/kafka_2.11-0.9.0.1/bin/kafka-run-class.sh
kafka.tools.DumpLogSegments --files
/var/db/kafka/test6-1/00000000000000000000.log --print-data-log=true | head
Dumping /var/db/kafka/test6-1/00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 isvalid: true payloadsize: 100 magic: 0 compresscodec:
NoCompressionCodec crc: 3514299894 payload:
...
offset: 13649 position: 1719774 isvalid: true payloadsize: 100 magic: 0
compresscodec: NoCompressionCodec crc: 3514299894 payload:
Found 518855022 invalid bytes at the end of 00000000000000000000.log
{code}
Each position is exactly 126 bytes apart in this case. Hexdumping the area
around the end of the valid section yields:
{code}
$ dd bs=1 skip=1719522 if=/var/db/kafka/test6-1/00000000000000000000.log |
hexdump -C
00000000 00 00 00 00 00 00 35 4f 00 00 00 72 d1 77 f5 f6 |......5O...r.w..|
00000010 00 00 ff ff ff ff 00 00 00 64 01 01 01 01 01 01 |.........d......|
00000020 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00000070 01 01 01 01 01 01 01 01 01 01 01 01 01 01 00 00 |................|
00000080 00 00 00 00 35 50 00 00 00 72 d1 77 f5 f6 00 00 |....5P...r.w....|
00000090 ff ff ff ff 00 00 00 64 01 01 01 01 01 01 01 01 |.......d........|
000000a0 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
000000f0 01 01 01 01 01 01 01 01 01 01 01 01 00 00 00 00 |................|
00000100 00 00 35 51 00 00 00 72 d1 77 f5 f6 00 00 ff ff |..5Q...r.w......|
00000110 ff ff 00 00 00 64 01 01 01 01 01 01 01 01 01 01 |.....d..........|
00000120 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00000170 01 01 01 01 01 01 01 01 01 01 00 00 00 00 00 00 |................|
00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 01 |................|
00000320 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00000370 01 01 00 00 00 00 00 00 35 56 00 00 00 72 d1 77 |........5V...r.w|
00000380 f5 f6 00 00 ff ff ff ff 00 00 00 64 01 01 01 01 |...........d....|
00000390 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
...
{code}
The last valid message ends at 0000017a, the next message - judging from the
pattern here - would start at 00000373. The bytes in between appear off, and
trip DumpLogSegments. The questionable area (within >>> and <<<) of the log
file hence is:
{code}
00000170 01 01 01 01 01 01 01 01 01 >>>01 00 00 00 00 00 00
|................|
00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 01 |................|
00000320 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
00000370 01 01<<< 00 00 00 00 00 00 35 56 00 00 00 72 d1 77
|........5V...r.w|
{code}
> Replication issues on FreeBSD
> -----------------------------
>
> Key: KAFKA-3240
> URL: https://issues.apache.org/jira/browse/KAFKA-3240
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.9.0.0, 0.8.2.2, 0.9.0.1
> Environment: FreeBSD 10.2-RELEASE-p9
> Reporter: Jan Omar
>
> Hi,
> We are trying to replace our 3-broker cluster running on 0.6 with a new
> cluster on 0.9.0.1 (but tried 0.8.2.2 and 0.9.0.0 as well).
> - 3 kafka nodes with one zookeeper instance on each machine
> - FreeBSD 10.2 p9
> - Nagle off (sysctl net.inet.tcp.delayed_ack=0)
> - all kafka machines write a ZFS ZIL to a dedicated SSD
> - 3 producers on 3 machines, writing to 1 topics, partitioning 3, replication
> factor 3
> - acks all
> - 10 Gigabit Ethernet, all machines on one switch, ping 0.05 ms worst case.
> While using the ProducerPerformance or rdkafka_performance we are seeing very
> strange Replication errors. Any hint on what's going on would be highly
> appreciated. Any suggestion on how to debug this properly would help as well.
> This is what our broker config looks like:
> {code}
> broker.id=5
> auto.create.topics.enable=false
> delete.topic.enable=true
> listeners=PLAINTEXT://:9092
> port=9092
> host.name=kafka-five.acc
> advertised.host.name=10.5.3.18
> zookeeper.connect=zookeeper-four.acc:2181,zookeeper-five.acc:2181,zookeeper-six.acc:2181
> zookeeper.connection.timeout.ms=6000
> num.replica.fetchers=1
> replica.fetch.max.bytes=100000000
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=300000
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=1000
> min.insync.replicas=2
> controller.socket.timeout.ms=30000
> controller.message.queue.size=100
> log.dirs=/var/db/kafka
> num.partitions=8
> message.max.bytes=100000000
> auto.create.topics.enable=false
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.hours=168
> log.flush.interval.ms=10000
> log.flush.interval.messages=20000
> log.flush.scheduler.interval.ms=2000
> log.roll.hours=168
> log.retention.check.interval.ms=300000
> log.segment.bytes=536870912
> zookeeper.connection.timeout.ms=1000000
> zookeeper.sync.time.ms=5000
> num.io.threads=8
> num.network.threads=4
> socket.request.max.bytes=104857600
> socket.receive.buffer.bytes=1048576
> socket.send.buffer.bytes=1048576
> queued.max.requests=100000
> fetch.purgatory.purge.interval.requests=100
> producer.purgatory.purge.interval.requests=100
> replica.lag.max.messages=10000000
> {code}
> These are the errors we're seeing:
> {code:borderStyle=solid}
> ERROR [Replica Manager on Broker 5]: Error processing fetch operation on
> partition [test,0] offset 50727 (kafka.server.ReplicaManager)
> java.lang.IllegalStateException: Invalid message size: 0
> at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:141)
> at kafka.log.LogSegment.translateOffset(LogSegment.scala:105)
> at kafka.log.LogSegment.read(LogSegment.scala:126)
> at kafka.log.Log.read(Log.scala:506)
> at
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:536)
> at
> kafka.server.ReplicaManager$$anonfun$readFromLocalLog$1.apply(ReplicaManager.scala:507)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> kafka.server.ReplicaManager.readFromLocalLog(ReplicaManager.scala:507)
> at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:462)
> at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:431)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:69)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
> at java.lang.Thread.run(Thread.java:745)0
> {code}
> and
> {code}
> ERROR Found invalid messages during fetch for partition [test,0] offset 2732
> error Message found with corrupt size (0) in shallow iterator
> (kafka.server.ReplicaFetcherThread)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)