[
https://issues.apache.org/jira/browse/RATIS-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760209#comment-17760209
]
Xinyu Tan commented on RATIS-1879:
----------------------------------
Hi, [~szetszwo] Thank you for your detailed explanation!
Regarding the statement "log corruption may only lead to the loss of
uncommitted data which is okay," I have some thoughts of my own. In theory, I
believe Raft requires that any log, once the {{appendLog}} function is called,
should be persisted, even if it hasn't been committed yet. This is to avoid
potential problems like the one described in the [Figure
8|https://github.com/maemual/raft-zh_cn/blob/master/images/raft-%E5%9B%BE8.png]
corner case, where logs are committed, applied, but lost upon restarting before
commit logs are stored. Of course, in practical engineering, it's not feasible
to take such measures for every log entry due to performance considerations
(such as using {{fsync}} each time and ensuring atomic writes to disk using a
double write buffer, etc.). As a result, there will inevitably be some corner
cases that could result in data loss. In such situations, I believe the
engineering expectation is to ensure that nodes can provide high availability
services after restarts or coordination with each other, and a small amount of
data loss is considered acceptable.
However, in the current case, this log corruption has led to the cluster being
unable to start, which could have significant implications for the business.
I'm considering whether some actions should be taken during the raftlog
recovery phase to enable the cluster to start. Alternatively, it might be worth
considering the addition of a "raftlog repair" tool to fix damaged log files.
Both approaches share the ultimate goal of getting the cluster up and running
again to continue providing services.
What's your opinion?
> Handle RaftLog corruption when unsafe flush is enabled.
> -------------------------------------------------------
>
> Key: RATIS-1879
> URL: https://issues.apache.org/jira/browse/RATIS-1879
> Project: Ratis
> Issue Type: Bug
> Components: server
> Affects Versions: 3.0.0, 2.5.1
> Reporter: Song Ziyang
> Assignee: Tsz-wo Sze
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> During normal operations of the RaftServer, its containing virtual machine
> (VM) was unexpectedly shut down and subsequently restarted. Following the VM
> reboot, *our attempts to restart the RaftServer led to encountering the
> subsequent exception, indicating corruption in the Raft* {*}Log{*}{*}.{*}
> *The details of this exception please refer to
> [https://apache-iotdb.feishu.cn/docx/Zmyudq0FYoDVcsxDwHpcINyznfg]*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)