[jira] [Commented] (RATIS-1879) Handle RaftLog corruption when unsafe flush is enabled.

Xinyu Tan (Jira) Tue, 29 Aug 2023 19:53:36 -0700


    [ 
https://issues.apache.org/jira/browse/RATIS-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760209#comment-17760209
 ]


Xinyu Tan commented on RATIS-1879:
----------------------------------

Hi, [~szetszwo] Thank you for your detailed explanation!

Regarding the statement "log corruption may only lead to the loss of 
uncommitted data which is okay," I have some thoughts of my own. In theory, I 
believe Raft requires that any log, once the {{appendLog}} function is called, 
should be persisted, even if it hasn't been committed yet. This is to avoid 
potential problems like the one described in the [Figure 
8|https://github.com/maemual/raft-zh_cn/blob/master/images/raft-%E5%9B%BE8.png] 
corner case, where logs are committed, applied, but lost upon restarting before 
commit logs are stored. Of course, in practical engineering, it's not feasible 
to take such measures for every log entry due to performance considerations 
(such as using {{fsync}} each time and ensuring atomic writes to disk using a 
double write buffer, etc.). As a result, there will inevitably be some corner 
cases that could result in data loss. In such situations, I believe the 
engineering expectation is to ensure that nodes can provide high availability 
services after restarts or coordination with each other, and a small amount of 
data loss is considered acceptable.

However, in the current case, this log corruption has led to the cluster being 
unable to start, which could have significant implications for the business.

I'm considering whether some actions should be taken during the raftlog 
recovery phase to enable the cluster to start. Alternatively, it might be worth 
considering the addition of a "raftlog repair" tool to fix damaged log files. 
Both approaches share the ultimate goal of getting the cluster up and running 
again to continue providing services.

 

What's your opinion?

> Handle RaftLog corruption when unsafe flush is enabled.
> -------------------------------------------------------
>
>                 Key: RATIS-1879
>                 URL: https://issues.apache.org/jira/browse/RATIS-1879
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 2.5.1
>            Reporter: Song Ziyang
>            Assignee: Tsz-wo Sze
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> During normal operations of the RaftServer, its containing virtual machine 
> (VM) was unexpectedly shut down and subsequently restarted. Following the VM 
> reboot, *our attempts to restart the RaftServer led to encountering the 
> subsequent exception, indicating corruption in the Raft* {*}Log{*}{*}.{*}
> *The details of this exception please refer to 
> [https://apache-iotdb.feishu.cn/docx/Zmyudq0FYoDVcsxDwHpcINyznfg]* 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (RATIS-1879) Handle RaftLog corruption when unsafe flush is enabled.

Reply via email to