[ https://issues.apache.org/jira/browse/KAFKA-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-10778. ------------------------------------- Fix Version/s: 2.8.0 Assignee: Tom Bentley Resolution: Fixed > Stronger log fencing after write failure > ---------------------------------------- > > Key: KAFKA-10778 > URL: https://issues.apache.org/jira/browse/KAFKA-10778 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Tom Bentley > Priority: Major > Fix For: 2.8.0 > > > If a log append operation fails with an IO error, the broker attempts to fail > the log dir that it resides in. Currently this is done asynchronously, which > means there is no guarantee that additional appends won't be attempted before > the log is fenced. This can be a problem for EOS because of the need to > maintain consistent producer state. > 1. Iterate through batches to build producer state and collect completed > transactions > 2. Append the batches to the log > 3. Update the offset/timestamp indexes > 4. Update log end offset > 5. Apply individual producer state to `ProducerStateManager` > 6. Update the transaction index > 7. Update completed transactions and advance LSO > One example of how this process can go wrong is if the index updates in step > 3 fail. In this case, the log will contain updated producer state which has > not been reflected in `ProducerStateManager`. If the append is retried before > the log is fenced, then we can have duplicates. There are probably other > potential failures that are possible as well. > I'm sure we can come up with some way to fix this specific case, but the > general fencing approach is slippery enough that we'll have a hard time > convincing ourselves that it handles all potential cases. It would be simpler > to add synchronous fencing logic for the case when an append fails due to an > IO error. For example, we can mark a flag to indicate that the log is closed > for additional read/write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005)