[ 
https://issues.apache.org/jira/browse/IOTDB-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Ziyang reassigned IOTDB-5248:
----------------------------------

    Assignee: Song Ziyang  (was: Xinyu Tan)

> [ratis] Restarting datanode takes a long time (reading raft log) and fails
> --------------------------------------------------------------------------
>
>                 Key: IOTDB-5248
>                 URL: https://issues.apache.org/jira/browse/IOTDB-5248
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: master branch, 1.0.0
>            Reporter: 刘珍
>            Assignee: Song Ziyang
>            Priority: Major
>         Attachments: iotdb_5248.conf, screenshot-1.png
>
>
> rel/1.0 1220 00a4080
> 1. 启动3副本3C5D集群,config/schema/data均是ratis协议
> 2.BM写入,完成。
> 3.ip73 cli flush
> stop-datanode.sh  take snapshot成功。
> start-datanode.sh 启动耗时过长,查看log,一直刷 读raft log的日志(stop 
> datanode,打的快照不起作用?)1小时20分钟后,报错,启动失败:
>   !screenshot-1.png!
> 2022-12-20 15:36:48,858 [main] ERROR o.a.i.c.ServerCommandLine:63 - Failed to 
> execute system command
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> Found a gap between logs: the last log segment log-76356_76501 ended at 76501 
> but the next log segment log-76618_76772 started at 76618
>         at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
>         at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
>         at 
> java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
>         at 
> java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1144)
>         at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>         at 
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:174)
>         at 
> org.apache.ratis.util.ConcurrentUtils.lambda$null$3(ConcurrentUtils.java:165)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {color:#DE350B}*Caused by: java.lang.IllegalStateException: Found a gap 
> between logs: the last log segment log-76356_76501 ended at 76501 but the 
> next log segment log-76618_76772 started at 76618*{color}
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:72)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.validateAdding(SegmentedRaftLogCache.java:421)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.addSegment(SegmentedRaftLogCache.java:428)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
>         at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
>         at 
> org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:236)
>         at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:217)
>         at 
> org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:160)
>         at 
> org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
>         at 
> org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:174)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:330)
>         at 
> org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:173)
>         ... 4 common frames omitted
>  
> 测试环境
> 1. 192.168.10.62/66/68 3ConfigNode 72cpu 256GB
> 192.168.10.62/66/68/64/73 5DataNode
> 73机器:48CPU 384GB
> 2.数据库配置参数
> COMMON配置
> schema_replication_factor=3
> data_replication_factor=3
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> query_timeout_threshold=3600000
> ConfigNode配置
> cn_connection_timeout_ms=120000
> MAX_HEAP_SIZE="8G"
> DataNode配置
> MAX_HEAP_SIZE="192G"
> MAX_DIRECT_MEMORY_SIZE="32G"
> dn_max_connection_for_internal_service=300
> 3.BM配置见附件
> 写入完成
> 4.ip73
> stop-datanode.sh
> 清缓存,启动datanode。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to