[ 
https://issues.apache.org/jira/browse/KAFKA-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus B updated KAFKA-5377:
----------------------------
    Description: 
We are running Kafka in a 2 x broker cluster configuration on Windows, and 
overall it has been working well for us. We have been seeing occasional issues 
where the broker crashes first on one node, and then almost immediately on the 
second. When we go and try to re-start, the broker continues to crash during 
startup until we fix the issue that caused the crash.

I finally figured out that the root cause of the startup crashes were a bad set 
of files in __consumer_offsets-2 (in this latest case, which offset is the 
cause varies). Once I deleted the bad files, the broker started up correctly 
again.

>From what I can tell, looking at both code, crash dump files, and log files, 
>it is all happening because of the log cleaner, and I can pinpoint it down in 
>most (if not all) cases to TimeIndex. The java dump file indicates some kind 
>of an access violation, but I am not sure when/how that is happening. It seems 
>like the initial crashes happen during the compacting/swapping action, and 
>then the startups fail when they try to access the bad files 
>(TimeIndex.parse()).

I am attaching dump files from two separate instances of when it initially 
crashed, and then when we try to restart. Also including the broker config 
settings that we are using.

I'm not sure what additional information to provide, but I can add more if 
needed.

Any help, suggestions or input would be very appreciated.

  was:
We are running Kafka in a 2 x broker cluster configuration on Windows, and 
overall it has been working well for us. We have been seeing occasional issues 
where the broker crashes first on one node, and then almost immediately on the 
second. When we go an try to re-start, the broker continues to crash during 
startup.

I finally figured out that the cause of the startup not working was a bad set 
of files in __consumer_offsets-2 (in this latest case). Once I deleted the bad 
files, the broker started up correctly again. 

>From what I can tell, looking at both code, crash dump files, it is all 
>happening because of the log cleaner, and I can pinpoint it down in most (if 
>not all) cases to TimeIndex. The java dump file indicates some kind of an 
>access violation, but I am not sure when/how that is happening.

I am attaching dump files from two separate instances of when it initially 
crashed, and then when we try to restart. Also including the broker config 
settings that we are using.

I'm not sure what additional information to provide, but I can add more if 
needed.

Any help, suggestions or input would be very appreciated.


> Kafka server process crashing due to access violation (caused by log cleaner)
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-5377
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5377
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.2.0, 0.10.2.1
>         Environment: Windows 2008 R2, Intel Xeon CPU, 64 GB RAM
> 4 Disk Drives (C for software, D for log files, E/F for Kafka/Zookeeper data)
> 2 broker cluster
>            Reporter: Markus B
>              Labels: windows
>         Attachments: hs_err_pid15944.log, hs_err_pid6304.log, 
> hs_err_pid7356.log, hs_err_pid9056.log, hs_err_pid9276.log, 
> java_error7192.log, server.1.properties
>
>
> We are running Kafka in a 2 x broker cluster configuration on Windows, and 
> overall it has been working well for us. We have been seeing occasional 
> issues where the broker crashes first on one node, and then almost 
> immediately on the second. When we go and try to re-start, the broker 
> continues to crash during startup until we fix the issue that caused the 
> crash.
> I finally figured out that the root cause of the startup crashes were a bad 
> set of files in __consumer_offsets-2 (in this latest case, which offset is 
> the cause varies). Once I deleted the bad files, the broker started up 
> correctly again.
> From what I can tell, looking at both code, crash dump files, and log files, 
> it is all happening because of the log cleaner, and I can pinpoint it down in 
> most (if not all) cases to TimeIndex. The java dump file indicates some kind 
> of an access violation, but I am not sure when/how that is happening. It 
> seems like the initial crashes happen during the compacting/swapping action, 
> and then the startups fail when they try to access the bad files 
> (TimeIndex.parse()).
> I am attaching dump files from two separate instances of when it initially 
> crashed, and then when we try to restart. Also including the broker config 
> settings that we are using.
> I'm not sure what additional information to provide, but I can add more if 
> needed.
> Any help, suggestions or input would be very appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to