Jason Gustafson created KAFKA-10778:
---------------------------------------
Summary: Stronger log fencing after write failure
Key: KAFKA-10778
URL: https://issues.apache.org/jira/browse/KAFKA-10778
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
If a log operation fails with an IO error, the broker attempts to fail the log
dir that it resides in. Currently this is done asynchronously, which means
there is no guarantee that additional appends won't be attempted before the log
is fenced. This can be a problem for EOS because of the need to maintain
consistent producer state.
1. Iterate through batches to build producer state and collect completed
transactions
2. Append the batches to the log
3. Update the offset/timestamp indexes
4. Update log end offset
5. Apply individual producer state to `ProducerStateManager`
6. Update the transaction index
7. Update completed transactions and advance LSO
One example of how this process can go wrong is if the index updates in step 3
fail. In this case, the log will contain updated producer state which has not
been reflected in `ProducerStateManager`. If the append is retried before the
log is fenced, then we can have duplicates. There are probably other potential
failures that are possible as well.
I'm sure we can come up with some way to fix this specific case, but the
general fencing approach is slippery enough that we'll have a hard time
convincing ourselves that it handles all potential cases. It would be simpler
to add synchronous fencing logic for the case when an append fails due to an IO
error. For example, we can mark a flag to indicate that the log is closed for
additional read/write operations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)