[ https://issues.apache.org/jira/browse/KAFKA-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jose Armando Garcia Sancio reassigned KAFKA-7408: ------------------------------------------------- Assignee: Jose Armando Garcia Sancio > Truncate to LSO on unclean leader election > ------------------------------------------ > > Key: KAFKA-7408 > URL: https://issues.apache.org/jira/browse/KAFKA-7408 > Project: Kafka > Issue Type: Improvement > Reporter: Jason Gustafson > Assignee: Jose Armando Garcia Sancio > Priority: Major > > If an unclean leader is elected, we may lose committed transaction data. That > alone is expected, but what is worse is that a transaction which was > previously completed (either committed or aborted) may lose its marker and > become dangling. The transaction coordinator will not know about the unclean > leader election, so will not know to resend the transaction markers. > Consumers with read_committed isolation will be stuck because the LSO cannot > advance. > To keep this scenario from occurring, it would be better to have the unclean > leader truncate to the LSO so that there are no dangling transactions. > Truncating to the LSO is not alone sufficient because the markers which > allowed the LSO advancement may be at higher offsets. What we can do is let > the newly elected leader truncate to the LSO and then rewrite all the markers > that followed it using its own leader epoch (to avoid divergence from > followers). > The interesting cases when an unclean leader election occurs are are when a > transaction is ongoing. > 1. If a producer is in the middle of a transaction commit, then the > coordinator may still attempt to write transaction markers. This will either > succeed or fail depending on the producer epoch in the unclean leader. If the > epoch matches, then the WriteTxnMarker call will succeed, which will simply > be ignored by the consumer. If the epoch doesn't match, the WriteTxnMarker > call will fail and the transaction coordinator can potentially remove the > partition from the transaction. > 2. If a producer is still writing the transaction, then what happens depends > on the producer state in the unclean leader. If no producer state has been > lost, then the transaction can continue without impact. Otherwise, the > producer will likely fail with an OUT_OF_ORDER_SEQUENCE error, which will > cause the transaction to be aborted by the coordinator. That takes us back to > the first case. > By truncating the LSO, we ensure that transactions are either preserved in > whole or they are removed from the log in whole. For an unclean leader > election, that's probably as good as we can do. But we are ensured that > consumers will not be blocked by dangling transactions. The only remaining > situation where a dangling transaction might be left is if one of the > transaction state partitions has an unclean leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)