[ 
https://issues.apache.org/jira/browse/KAFKA-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-7866.
------------------------------------
       Resolution: Fixed
         Assignee: Jason Gustafson
    Fix Version/s: 2.2.1
                   2.1.2
                   2.0.2

> Duplicate offsets after transaction index append failure
> --------------------------------------------------------
>
>                 Key: KAFKA-7866
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7866
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>             Fix For: 2.0.2, 2.1.2, 2.2.1
>
>
> We have encountered a situation in which an ABORT marker was written 
> successfully to the log, but failed to be written to the transaction index. 
> This prevented the log end offset from being incremented. This resulted in 
> duplicate offsets when the next append was attempted. The broker was using 
> JBOD and we would normally expect IOExceptions to cause the log directory to 
> be failed. That did not seem to happen here and the duplicates continued for 
> several hours.
> Unfortunately, we are not sure what the cause of the failure was. 
> Significantly, the first duplicate was also the first ABORT marker in the 
> log. Unlike the offset and timestamp index, the transaction index is created 
> on demand after the first aborted transction. It is likely that the attempt 
> to create and open the transaction index failed. There is some suggestion 
> that the process may have bumped into the open file limit. Whatever the 
> problem was, it also prevented log collection, so we cannot confirm our 
> guesses. 
> Without knowing the underlying cause, we can still consider some potential 
> improvements:
> 1. We probably should be catching non-IO exceptions in the append process. If 
> the append to one of the indexes fails, we potentially truncate the log or 
> re-throw it as an IOException to ensure that the log directory is no longer 
> used.
> 2. Even without the unexpected exception, there is a small window during 
> which even an IOException could lead to duplicate offsets. Marking a log 
> directory offline is an asynchronous operation and there is no guarantee that 
> another append cannot happen first. Given this, we probably need to detect 
> and truncate duplicates during the log recovery process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to