[ 
https://issues.apache.org/jira/browse/KAFKA-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-10778.
-------------------------------------
    Fix Version/s: 2.8.0
         Assignee: Tom Bentley
       Resolution: Fixed

> Stronger log fencing after write failure
> ----------------------------------------
>
>                 Key: KAFKA-10778
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10778
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Tom Bentley
>            Priority: Major
>             Fix For: 2.8.0
>
>
> If a log append operation fails with an IO error, the broker attempts to fail 
> the log dir that it resides in. Currently this is done asynchronously, which 
> means there is no guarantee that additional appends won't be attempted before 
> the log is fenced. This can be a problem for EOS because of the need to 
> maintain consistent producer state.
> 1. Iterate through batches to build producer state and collect completed 
> transactions
> 2. Append the batches to the log 
> 3. Update the offset/timestamp indexes
> 4. Update log end offset
> 5. Apply individual producer state to `ProducerStateManager`
> 6. Update the transaction index
> 7. Update completed transactions and advance LSO
> One example of how this process can go wrong is if the index updates in step 
> 3 fail. In this case, the log will contain updated producer state which has 
> not been reflected in `ProducerStateManager`. If the append is retried before 
> the log is fenced, then we can have duplicates. There are probably other 
> potential failures that are possible as well.
> I'm sure we can come up with some way to fix this specific case, but the 
> general fencing approach is slippery enough that we'll have a hard time 
> convincing ourselves that it handles all potential cases. It would be simpler 
> to add synchronous fencing logic for the case when an append fails due to an 
> IO error. For example, we can mark a flag to indicate that the log is closed 
> for additional read/write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to