[
https://issues.apache.org/jira/browse/RATIS-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766699#comment-17766699
]
Sammi Chen commented on RATIS-1891:
-----------------------------------
This is a different as RATIS-1887. There is no raft log truncated involved or
raft log purge involved. You can find that on the two SCM, the raft log are
consistent, while on the one that SCM fail to start, it only has one raft log
file, which has a hole of raft log index.
> Gap between logs cause service startup failure
> ----------------------------------------------
>
> Key: RATIS-1891
> URL: https://issues.apache.org/jira/browse/RATIS-1891
> Project: Ratis
> Issue Type: Bug
> Reporter: Sammi Chen
> Priority: Critical
>
> This is the second raft gap problem reported by Guo Hao.
> {code:java}
> 2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start
> failed with exception
> java.util.concurrent.CompletionException: java.lang.IllegalStateException:
> gap between start index 375 and first entry to append 377
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
> at
> java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
> at
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: gap between start index 375 and
> first entry to append 377
> at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
> at
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
> at
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
> at
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
> at
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
> at
> org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
> at
> org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
> at
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
> at
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
> at
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
> ... 4 more
> {code}
> # The gap server directory
> {code:java}
> $ ll
> /home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
> total 4120
> -rw-rw-r-- 1 work work 14567 Sep 8 18:30 log_291-374
> -rw-rw-r-- 1 work work 4194304 Sep 8 18:30 log_inprogress_375
> -rw-rw-r-- 1 work work 50 Sep 8 18:30 raft-meta
> -rw-rw-r-- 1 work work 242 Sep 8 17:29 raft-meta.conf
> {code}
>
> The other two
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work 95 Sep 8 12:13 log_0-0
> -rw-rw-r-- 1 work work 39285 Sep 8 17:30 log_1-290
> -rw-rw-r-- 1 work work 14567 Sep 8 17:35 log_291-374
> -rw-rw-r-- 1 work work 271 Sep 8 17:50 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep 8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work 86 Sep 8 18:29 raft-meta
> -rw-rw-r-- 1 work work 242 Sep 8 18:29 raft-meta.conf
> {code}
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work 95 Sep 8 13:15 log_0-0
> -rw-rw-r-- 1 work work 39285 Sep 8 17:29 log_1-290
> -rw-rw-r-- 1 work work 14567 Sep 8 17:35 log_291-374
> -rw-rw-r-- 1 work work 271 Sep 8 18:29 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep 8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work 86 Sep 8 18:29 raft-meta
> -rw-rw-r-- 1 work work 242 Sep 8 18:29 raft-meta.conf
> {code}
>
> Related Configurations:
> {code:java}
> <property>
> <name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
> <value>false</value>
> </property>
> <property>
> <name>hdds.ratis.raft.server.log.async-flush.enabled</name>
> <value>false</value>
> </property>
> {code}
>
> The scene in which the GAP occurs this time is as follows:
> 1. shutdown scm, shutdown more than 60s timeout kill -9
> 2. restart scm, this error occurs
--
This message was sent by Atlassian Jira
(v8.20.10#820010)