Hi everyone,
I'd like to discuss a series of enhancement to the replication protocol.

A partition replica can experience local data loss in unclean shutdown
scenarios where unflushed data in the OS page cache is lost - such as an
availability zone power outage or a server error. The Kafka replication
protocol is designed to handle these situations by removing such replicas
from the ISR and only re-adding them once they have caught up and therefore
recovered any lost data. This prevents replicas that lost an arbitrary log
suffix, which included committed data, from being elected leader.
However, there is a "last replica standing" state which when combined with
a data loss unclean shutdown event can turn a local data loss scenario into
a global data loss scenario, i.e., committed data can be removed from all
replicas. When the last replica in the ISR experiences an unclean shutdown
and loses committed data, it will be reelected leader after starting up
again, causing rejoining followers to truncate their logs and thereby
removing the last copies of the committed records which the leader lost
initially.

The new KIP will maximize the protection and provides MinISR-1 tolerance to
data loss unclean shutdown events.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas

Reply via email to