[ 
https://issues.apache.org/jira/browse/KAFKA-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734928#comment-14734928
 ] 

Håkon Hitland commented on KAFKA-2477:
--------------------------------------

I was able to enable trace logging on a production server, and have captured 
logs from the leader when the error happens.

It looks like the attempted read happens right before the log is actually 
appended. I don't see any other abnormal behaviour.

Looking at the code in question, I think I have an idea of how it might happen:

kafka.log.Log uses a lock to synchronize writes, but not reads.

Assume a write W1 has gotten as far as FileMessageSet.append() and has just 
executed _size.getAndAdd(written)

Now a concurrent read R1 comes in. In FileMessageSet.read(), it can get a new 
message set with end = math.min(this.start + position + size, sizeInBytes()). 
This includes the message that was just written in W1.

The read finishes, and a new read R2 starts. R2 tries to continue from W1, but 
in Log.read() it finds that startOffset is larger than 
nextOffsetMetadata.messageOffset and throws an exception.
(By the way, Log.read() can potentially read nextOffsetMetadata multiple times, 
with no guarantee that it hasn't changed. It's not obvious to me that this is 
correct.)

Finally, W1 updates nextOffsetMetadata in Log.updateLogEndOffset(), too late 
for R2 which has already triggered a log truncation on the replica.

Some possible solutions:
- Synchronize access to nextOffsetMetadata in Log.read()
- Clamp reads in Log.read() to never go beyond the current message offset.

> Replicas spuriously deleting all segments in partition
> ------------------------------------------------------
>
>                 Key: KAFKA-2477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2477
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Håkon Hitland
>         Attachments: kafka_log.txt, kafka_log_trace.txt
>
>
> We're seeing some strange behaviour in brokers: a replica will sometimes 
> schedule all segments in a partition for deletion, and then immediately start 
> replicating them back, triggering our check for under-replicating topics.
> This happens on average a couple of times a week, for different brokers and 
> topics.
> We have per-topic retention.ms and retention.bytes configuration, the topics 
> where we've seen this happen are hitting the size limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to