Ok, I spent some more time staring at our logs and figured out that it was our 
fault. We were not waiting around for the Kafka broker to fully initialize 
before moving on to the next broker and loading the data logs can take quite 
some time (~7 minutes in one case), so   we ended up with no replicas online at 
some point and the replica that came back first was a little short on data...

How do you automate waiting for the broker to come up? Just keep monitoring the 
process and keep trying to connect to the port?

/Sam

On Aug 29, 2013, at 6:40 PM, Sam Meder <sam.me...@jivesoftware.com> wrote:

> 
> On Aug 29, 2013, at 5:50 PM, Sriram Subramanian <srsubraman...@linkedin.com> 
> wrote:
> 
>> Do you know why you timed out on a regular shutdown?
> 
> No, though I think it may just have been that the timeout we put in was too 
> short.
> 
>> If the replica had
>> fallen off of the ISR and shutdown was forced on the leader this could
>> happen.
> 
> Hmm, but it shouldn't really be made leader if it isn't even in the isr, 
> should it?
> 
> /Sam
> 
>> With ack = -1, we guarantee that all the replicas in the in sync
>> set have received the message before exposing the message to the consumer.
>> 
>> On 8/29/13 8:32 AM, "Sam Meder" <sam.me...@jivesoftware.com> wrote:
>> 
>>> We've recently come across a scenario where we see consumers resetting
>>> their offsets to earliest and which as far as I can tell may also lead to
>>> data loss (we're running with ack = -1 to avoid loss). This seems to
>>> happen when we time out on doing a regular shutdown and instead kill -9
>>> the kafka broker, but does obviously apply to any scenario that involves
>>> a unclean exit. As far as I can tell what happens is
>>> 
>>> 1. On restart the broker truncates the data for the affected partitions,
>>> i.e. not all data was written to disk.
>>> 2. The new broker then becomes a leader for the affected partitions and
>>> consumers get confused because they've already consumed beyond the now
>>> available offset.
>>> 
>>> Does that seem like a possible failure scenario?
>>> 
>>> /Sam
>> 
> 

Reply via email to