[ 
https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049156#comment-16049156
 ] 

Carsten Rietz commented on KAFKA-5431:
--------------------------------------

I still have no idea what exactly to do to trigger this behaviour. Maybe its 
some strange behaviour of our producers. I have been able to find one instance 
on our dev system with a low roll interval, a one broker cluster and some 
restarts.


Here is the corrupt log file:
{code}
[user@host __consumer_offsets-12# ll -hs | head
total 2.7M
   0 -rw-r--r-- 1 jboss jboss    0 Jun 14 13:31 00000000000000000000.index
4.0K -rw-r--r-- 1 jboss jboss 100M Jun 14 12:30 00000000000000000000.log
4.0K -rw-r--r-- 1 jboss jboss   12 Jun 14 13:31 00000000000000000000.timeindex
   0 -rw-r--r-- 1 jboss jboss    0 Jun 14 13:31 00000000000000000001.index
4.0K -rw-r--r-- 1 jboss jboss  122 Jun 14 12:30 00000000000000000001.log
4.0K -rw-r--r-- 1 jboss jboss   12 Jun 14 13:31 00000000000000000001.timeindex
   0 -rw-r--r-- 1 jboss jboss    0 Jun 14 13:31 00000000000000000002.index
4.0K -rw-r--r-- 1 jboss jboss  122 Jun 14 12:30 00000000000000000002.log
4.0K -rw-r--r-- 1 jboss jboss   12 Jun 14 13:31 00000000000000000002.timeindex
[user@host __consumer_offsets-10]# stat 00000000000000000000.log
  File: `00000000000000000000.log'
  Size: 104857600       Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 4718956     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  494/   jboss)   Gid: (  488/   jboss)
Access: 2017-06-14 12:31:03.585079439 +0100
Modify: 2017-06-14 12:30:42.000000000 +0100
Change: 2017-06-14 12:30:48.520886205 +0100
{code}
{code}
[user@host __consumer_offsets-10]# kafka-run-class.sh 
kafka.tools.DumpLogSegments --files 00000000000000000000.log
Dumping 00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 CreateTime: 1497439842157 isvalid: true payloadsize: 257 
magic: 1 compresscodec: NONE crc: 1102674092 keysize: 26
Exception in thread "main" 
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).
{code}
hexdumping the file also shows that its still a sparse file with trailing zero 
content
{code}
[user@host __consumer_offsets-10]# hexdump -C 00000000000000000000.log
00000000  00 00 00 00 00 00 00 00  00 00 01 31 41 b9 78 ac  |...........1A.x.|
00000010  01 00 00 00 01 5c a6 5e  af 6d 00 00 00 1a 00 02  |.....\.^.m......|
00000020  00 16 63 6f 6e 73 6f 6c  65 2d 63 6f 6e 73 75 6d  |..console-consum|
00000030  65 72 2d 35 39 37 37 33  00 00 01 01 00 01 00 08  |er-59773........|
00000040  63 6f 6e 73 75 6d 65 72  00 00 00 01 00 05 72 61  |consumer......ra|
00000050  6e 67 65 00 2f 63 6f 6e  73 75 6d 65 72 2d 31 2d  |nge./consumer-1-|
00000060  39 39 36 33 62 31 36 36  2d 39 32 39 39 2d 34 34  |9963b166-9299-44|
00000070  30 66 2d 38 64 39 37 2d  35 62 34 38 31 33 64 35  |0f-8d97-5b4813d5|
00000080  63 39 38 34 00 00 00 01  00 2f 63 6f 6e 73 75 6d  |c984...../consum|
00000090  65 72 2d 31 2d 39 39 36  33 62 31 36 36 2d 39 32  |er-1-9963b166-92|
000000a0  39 39 2d 34 34 30 66 2d  38 64 39 37 2d 35 62 34  |99-440f-8d97-5b4|
000000b0  38 31 33 64 35 63 39 38  34 00 0a 63 6f 6e 73 75  |813d5c984..consu|
000000c0  6d 65 72 2d 31 00 0e 2f  31 30 2e 32 34 38 2e 34  |mer-1../10.248.4|
000000d0  33 2e 32 33 33 00 04 93  e0 00 00 27 10 00 00 00  |3.233......'....|
000000e0  28 00 00 00 00 00 01 00  1c 74 63 2e 69 6e 2e 64  |(........tc.in.d|
000000f0  65 76 2e 73 74 61 74 69  73 74 69 63 73 2e 67 72  |ev.statistics.gr|
00000100  61 70 68 69 74 00 00 00  00 00 00 00 30 00 00 00  |aphit.......0...|
00000110  00 00 01 00 1c 74 63 2e  69 6e 2e 64 65 76 2e 73  |.....tc.in.dev.s|
00000120  74 61 74 69 73 74 69 63  73 2e 67 72 61 70 68 69  |tatistics.graphi|
00000130  74 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |t...............|
00000140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
06400000
{code}

We are running kafka_2.12-0.10.2.1 on jdk1.8.0_121 on Red Hat Enterprise Linux 
Server release 6.9.
Relevant parameters:
{code}
log.segment.bytes=10485760
log.retention.check.interval.ms=1000
log.roll.ms=10000
log.preallocate=true
log.retention.bytes=10485760
{code}

And finally the log files. I removed all __consumer_offsets and restarted the 
kafka broker due to full disk before.
{code}
[2017-06-14 12:28:46,198] INFO Started Kafka Graphite metrics reporter with 
polling period 1 seconds (com.criteo.kafka.KafkaGraphiteMetricsReporter)
...
[2017-06-14 12:28:51,170] INFO [Group Metadata Manager on Broker 1]: Loading 
offsets and group metadata from __consumer_offsets-12 
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:28:51,189] INFO [Group Metadata Manager on Broker 1]: Finished 
loading offsets from __consumer_offsets-12 in 19 milliseconds. 
(kafka.coordinator.GroupMetadataManager)
[2017-06-14 12:30:47,164] INFO Rolled new log segment for 
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
[2017-06-14 12:30:48,431] INFO Cleaner 0: Beginning cleaning of log 
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,432] INFO Cleaner 0: Building offset map for 
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,470] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-12 for 1 segments in offset range [0, 1). 
(kafka.log.LogCleaner)
[2017-06-14 12:30:48,499] INFO Cleaner 0: Offset map for log 
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,503] INFO Cleaner 0: Cleaning log __consumer_offsets-12 
(cleaning prior to Wed Jun 14 12:30:42 BST 2017, discarding tombstones prior to 
Thu Jan 01 01:00:00 GMT 1970)... (kafka.log.LogCleaner)
[2017-06-14 12:30:48,507] INFO Cleaner 0: Cleaning segment 0 in log 
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0, 
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:30:48,521] INFO Cleaner 0: Swapping in cleaned segment 0 for 
segment(s) 0 in log __consumer_offsets-12. (kafka.log.LogCleaner)
        Log cleaner thread 0 cleaned log __consumer_offsets-12 (dirty section = 
[0, 0])
[2017-06-14 12:30:52,161] INFO Rolled new log segment for 
'__consumer_offsets-12' in 1 ms. (kafka.log.Log)
{code}

Here the log file seems to have been rolled and not truncated. What tips over 
the logcleaner a bit later

{code}
[2017-06-14 12:31:03,564] INFO Cleaner 0: Beginning cleaning of log 
__consumer_offsets-12. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,564] INFO Cleaner 0: Building offset map for 
__consumer_offsets-12... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,583] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-12 for 3 segments in offset range [1, 4). 
(kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Offset map for log 
__consumer_offsets-12 complete. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,584] INFO Cleaner 0: Cleaning log __consumer_offsets-12 
(cleaning prior to Wed Jun 14 12:31:02 BST 2017, discarding tombstones prior to 
Tue Jun 13 12:30:42 BST 2017)... (kafka.log.LogCleaner)
[2017-06-14 12:31:03,585] INFO Cleaner 0: Cleaning segment 0 in log 
__consumer_offsets-12 (largest timestamp Wed Jun 14 12:30:42 BST 2017) into 0, 
retaining deletes. (kafka.log.LogCleaner)
[2017-06-14 12:31:03,589] ERROR [kafka-log-cleaner-thread-0], Error due to  
(kafka.log.LogCleaner)
org.apache.kafka.common.errors.CorruptRecordException: Record size is less than 
the minimum record overhead (14)
{code}

> LogCleaner stopped due to 
> org.apache.kafka.common.errors.CorruptRecordException
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-5431
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5431
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.2.1
>            Reporter: Carsten Rietz
>              Labels: reliability
>
> Hey all,
> i have a strange problem with our uat cluster of 3 kafka brokers.
> the __consumer_offsets topic was replicated to two instances and our disks 
> ran full due to a wrong configuration of the log cleaner. We fixed the 
> configuration and updated from 0.10.1.1 to 0.10.2.1 .
> Today i increased the replication of the __consumer_offsets topic to 3 and 
> triggered replication to the third cluster via kafka-reassign-partitions.sh. 
> That went well but i get many errors like
> {code}
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,18] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for 
> partition [__consumer_offsets,24] offset 0 error Record size is less than the 
> minimum record overhead (14) (kafka.server.ReplicaFetcherThread)
> {code}
> Which i think are due to the full disk event.
> The log cleaner threads died on these wrong messages:
> {code}
> [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is less 
> than the minimum record overhead (14)
> [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped  
> (kafka.log.LogCleaner)
> {code}
> Looking at the file is see that some are truncated and some are jsut empty:
> $ ls -lsh 00000000000000594653.log
> 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00000000000000594653.log
> Sadly i do not have the logs any more from the disk full event itsself.
> I have three questions:
> * What is the best way to clean this up? Deleting the old log files and 
> restarting the brokers?
> * Why did kafka not handle the disk full event well? Is this only affecting 
> the cleanup or may we also loose data?
> * Is this maybe caused by the combination of upgrade and disk full?
> And last but not least: Keep up the good work. Kafka is really performing 
> well while being easy to administer and has good documentation!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to