[ https://issues.apache.org/jira/browse/ZOOKEEPER-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893168#comment-17893168 ]
Kezhu Wang commented on ZOOKEEPER-4882: --------------------------------------- {quote} But once it is restarted, it has no much difference to the follower case. {quote} This is no true. The snapshot has no data loss. But once it got elected, it could propagate data loss in txn log through DIFF sync. We need test to verify this. > Data loss after restarting an node experienced temporary disk error and rejoin > ------------------------------------------------------------------------------ > > Key: ZOOKEEPER-4882 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4882 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.8.4, 3.9.3 > Reporter: Kezhu Wang > Priority: Major > > The cause is multifold: > 1. Leader will commit a proposal once quorum acked. > 2. Proposal is able to be committed in node's memory even if it has not > been written to that node's disk. > 3. In case of disk error, the txn log could lag behind memory database. > The above applies to both leader and follower. I have not verified leader > branch, let's consider only follower for now. > f4. A follower experienced temporary disk error will have hole in txn log > after re-join. > f5. Restarted follower will lose the data. Worse, it is able to win > election and propagate data loss to whole cluster. > I authored commits in my repo to expose this. > https://github.com/kezhuw/zookeeper/commits/data-loss-temporary-sync-disk-error/ -- This message was sent by Atlassian Jira (v8.20.10#820010)