[jira] [Updated] (KAFKA-14077) KRaft should support recovery from failed disk

A. Sophie Blee-Goldman (Jira) Wed, 14 Dec 2022 19:27:34 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


A. Sophie Blee-Goldman updated KAFKA-14077:
-------------------------------------------
    Fix Version/s: 3.5.0
                       (was: 3.4.0)

> KRaft should support recovery from failed disk
> ----------------------------------------------
>
>                 Key: KAFKA-14077
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14077
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Blocker
>             Fix For: 3.5.0
>
>
> If one of the nodes in the metadata quorum has a disk failure, there is no 
> way currently to safely bring the node back into the quorum. When we lose 
> disk state, we are at risk of losing committed data even if the failure only 
> affects a minority of the cluster.
> Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and 
> v3. Initially, v1 is the leader and writes a record at offset 1. After v2 
> acknowledges replication of the record, it becomes committed. Suppose that v1 
> fails before v3 has a chance to replicate this record. As long as v1 remains 
> down, the raft protocol guarantees that only v2 can become leader, so the 
> record cannot be lost. The raft protocol expects that when v1 returns, it 
> will still have that record, but what if there is a disk failure, the state 
> cannot be recovered and v1 participates in leader election? Then we would 
> have committed data on a minority of the voters. The main problem here 
> concerns how we recover from this impaired state without risking the loss of 
> this data.
> Consider a naive solution which brings v1 back with an empty disk. Since the 
> node has lost is prior knowledge of the state of the quorum, it will vote for 
> any candidate that comes along. If v3 becomes a candidate, then it will vote 
> for itself and it just needs the vote from v1 to become leader. If that 
> happens, then the committed data on v2 will become lost.
> This is just one scenario. In general, the invariants that the raft protocol 
> is designed to preserve go out the window when disk state is lost. For 
> example, it is also possible to contrive a scenario where the loss of disk 
> state leads to multiple leaders. There is a good reason why raft requires 
> that any vote cast by a voter is written to disk since otherwise the voter 
> may vote for different candidates in the same epoch.
> Many systems solve this problem with a unique identifier which is generated 
> automatically and stored on disk. This identifier is then committed to the 
> raft log. If a disk changes, we would see a new identifier and we can prevent 
> the node from breaking raft invariants. Then recovery from a failed disk 
> requires a quorum reconfiguration. We need something like this in KRaft to 
> make disk recovery possible.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14077) KRaft should support recovery from failed disk

Reply via email to