[ 
https://issues.apache.org/jira/browse/KAFKA-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Armando Garcia Sancio reassigned KAFKA-7408:
-------------------------------------------------

    Assignee: Jose Armando Garcia Sancio

> Truncate to LSO on unclean leader election
> ------------------------------------------
>
>                 Key: KAFKA-7408
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7408
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jason Gustafson
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Major
>
> If an unclean leader is elected, we may lose committed transaction data. That 
> alone is expected, but what is worse is that a transaction which was 
> previously completed (either committed or aborted) may lose its marker and 
> become dangling. The transaction coordinator will not know about the unclean 
> leader election, so will not know to resend the transaction markers. 
> Consumers with read_committed isolation will be stuck because the LSO cannot 
> advance.
> To keep this scenario from occurring, it would be better to have the unclean 
> leader truncate to the LSO so that there are no dangling transactions. 
> Truncating to the LSO is not alone sufficient because the markers which 
> allowed the LSO advancement may be at higher offsets. What we can do is let 
> the newly elected leader truncate to the LSO and then rewrite all the markers 
> that followed it using its own leader epoch (to avoid divergence from 
> followers).
> The interesting cases when an unclean leader election occurs are are when a 
> transaction is ongoing. 
> 1. If a producer is in the middle of a transaction commit, then the 
> coordinator may still attempt to write transaction markers. This will either 
> succeed or fail depending on the producer epoch in the unclean leader. If the 
> epoch matches, then the WriteTxnMarker call will succeed, which will simply 
> be ignored by the consumer. If the epoch doesn't match, the WriteTxnMarker 
> call will fail and the transaction coordinator can potentially remove the 
> partition from the transaction.
> 2. If a producer is still writing the transaction, then what happens depends 
> on the producer state in the unclean leader. If no producer state has been 
> lost, then the transaction can continue without impact. Otherwise, the 
> producer will likely fail with an OUT_OF_ORDER_SEQUENCE error, which will 
> cause the transaction to be aborted by the coordinator. That takes us back to 
> the first case.
> By truncating the LSO, we ensure that transactions are either preserved in 
> whole or they are removed from the log in whole. For an unclean leader 
> election, that's probably as good as we can do. But we are ensured that 
> consumers will not be blocked by dangling transactions. The only remaining 
> situation where a dangling transaction might be left is if one of the 
> transaction state partitions has an unclean leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to