[ 
https://issues.apache.org/jira/browse/KAFKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808761#comment-13808761
 ] 

Jay Kreps commented on KAFKA-1106:
----------------------------------

Yeah a corrupted offset file would lead to this (but could also be some other 
bug). We do shut down the broker on any I/O error (as that means we don't know 
the state of the data on disk and need to run recovery). Do you have the log 
from that previous shutdown?

If the offset checkpoint is corrupt I think the desired behavior is for the 
node to crash. So in that case I think the problem is that we throw that number 
format exception which we probably don't handle right instead of IOException 
which would cause the broker to shoot itself in the head.

Let's do this: I'll fix the parsing logic on trunk so that any unparsable file 
throws IOException. This will let us gracefully handle corruption in the file. 
I'm still not convinced that this is a file corruption thing and not just some 
bug in our code, but without the actual file it's a little hard to know. If you 
can reproduce it on another machine that proves it is a bug--if so grab the 
file, I suspect it will give a clue what is going on.

> HighwaterMarkCheckpoint failure puting broker into a bad state
> --------------------------------------------------------------
>
>                 Key: KAFKA-1106
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1106
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: David Lao
>         Attachments: KAFKA-1106-patch, kafka.log
>
>
> I'm encountering a case where broker get stuck due to HighwaterMarkCheckpoint 
> failing to recover from reading what appear to be corrupted isr entries. Once 
> in this state, leader election can never succeed and hence stalling the 
> entire cluster. 
> Please see the detailed stack trace from the attached log.  Perhaps failing 
> fast when HighwaterMarkCheckpoint fails to read would force the broker to 
> restart and recover.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to