[ 
https://issues.apache.org/jira/browse/RATIS-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766699#comment-17766699
 ] 

Sammi Chen commented on RATIS-1891:
-----------------------------------

This is a different as RATIS-1887. There is no raft log truncated involved or 
raft log purge involved. You can find that on the two SCM, the raft log are 
consistent, while on the one that SCM fail to start, it only has one raft log 
file, which has a hole of  raft log index. 

> Gap between logs cause service startup failure
> ----------------------------------------------
>
>                 Key: RATIS-1891
>                 URL: https://issues.apache.org/jira/browse/RATIS-1891
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Sammi Chen
>            Priority: Critical
>
> This is the second raft gap problem reported by Guo Hao.
> {code:java}
> 2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start 
> failed with exception
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> gap between start index 375 and first entry to append 377
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at 
> java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
>         at 
> java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
>         at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: gap between start index 375 and 
> first entry to append 377
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
>         at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
>         at 
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
>         at 
> org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
>         at 
> org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
>         at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
>         ... 4 more 
> {code}
> # The gap server directory 
> {code:java}
> $ ll 
> /home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
> total 4120
> -rw-rw-r-- 1 work work   14567 Sep  8 18:30 log_291-374
> -rw-rw-r-- 1 work work 4194304 Sep  8 18:30 log_inprogress_375
> -rw-rw-r-- 1 work work      50 Sep  8 18:30 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 17:29 raft-meta.conf 
> {code}
>  
> The other two
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work      95 Sep  8 12:13 log_0-0
> -rw-rw-r-- 1 work work   39285 Sep  8 17:30 log_1-290
> -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
> -rw-rw-r-- 1 work work     271 Sep  8 17:50 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
> {code}
> {code:java}
> $ ll
> total 4168
> -rw-rw-r-- 1 work work      95 Sep  8 13:15 log_0-0
> -rw-rw-r-- 1 work work   39285 Sep  8 17:29 log_1-290
> -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
> -rw-rw-r-- 1 work work     271 Sep  8 18:29 log_375-376
> -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
> -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
> -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
> {code}
>  
> Related Configurations:
> {code:java}
> <property>
> <name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
> <value>false</value>
> </property>
> <property>
> <name>hdds.ratis.raft.server.log.async-flush.enabled</name>
> <value>false</value>
> </property> 
> {code}
>  
> The scene in which the GAP occurs this time is as follows:
> 1. shutdown scm, shutdown more than 60s timeout kill -9
> 2. restart scm, this error occurs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to