[
https://issues.apache.org/jira/browse/HDDS-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saketa Chalamchala updated HDDS-14078:
--------------------------------------
Description:
Triggered OM bootstrap on a lagging follower OM in the docker environment.
{code:java}
1. Start docker env with configs
OZONE-SITE.XML_ozone.om.ratis.log.purge.gap=50
OZONE-SITE.XML_ozone.om.ratis.segment.size=16KB
OZONE-SITE.XML_ozone.om.ratis.segment.preallocated.size=16KB
OZONE-SITE.XML_ozone.om.ratis.snapshot.auto.trigger.threshold=500
2. Stop a follower OM
3. Created 500 key metadata
ozone freon omkg -t 10 -n 500 -v vol1 -b bucket1
ozone sh snapshot create vol1/bucket1 snap-1
ozone freon omkg -t 10 -n 100 -v vol1 -b bucket1
ozone sh snapshot create vol1/bucket1 snap-2
4. Start the OM from Step2{code}
Install checkpoint never completes. Debug logging on the leader needs to be
improved.
Leader logs:
{code:java}
2025-12-04 16:46:48.440 | 2025-12-05 00:46:48,440
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.444 | 2025-12-05 00:46:48,443 [timer6] WARN
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-AppendLogResponseHandler:
received INCONSISTENCY reply with nextIndex 1, errorCount=1,
request=AppendEntriesRequest:cid=120166906,entriesCount=0 (Repeated 639 times
in the last 5.001s)
2025-12-04 16:46:48.445 | 2025-12-05 00:46:48,445
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.446 | 2025-12-05 00:46:48,446 [timer7] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
(Repeated 1040 times in the last 5.001s)
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,449 [grpc-default-executor-3]
WARN server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-AppendLogResponseHandler: received INCONSISTENCY
reply with nextIndex 1, errorCount=1,
request=AppendEntriesRequest:cid=120166908,entriesCount=0
2025-12-04 16:46:48.452 | 2025-12-05 00:46:48,452
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.455 | 2025-12-05 00:46:48,455
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om
...
2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,555 [timer1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 (Repeated 1070 times in the
last 5.001s)
2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,556 [timer2] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
(Repeated 1070 times in the last 5.002s)
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
INFO server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
INFO server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
...
2025-12-04 16:46:49.251 | 2025-12-05 00:46:49,251
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
WARN server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: Failed InstallSnapshot
2025-12-04 16:46:49.285 | java.lang.IllegalStateException:
inProgressInstallSnapshotRequest: -1 is not eligible, firstAvailableLogIndex:
981
2025-12-04 16:46:49.285 | at
org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:65)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.notifyStateMachineToInstallSnapshot(SnapshotInstallationHandler.java:341)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:137)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:107)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1696)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:672)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:285)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:282)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:152)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:334)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:319)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
2025-12-04 16:46:49.285 | at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2025-12-04 16:46:49.285 | at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2025-12-04 16:46:49.285 | at
java.base/java.lang.Thread.run(Thread.java:1583)
2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
WARN server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
Follower failed (request=null, errorCount=2); keep nextIndex (1) unchanged and
retry. {code}
Follower logs:
{code:java}
Follower logs keep repeating: 2025-12-04 16:46:43.427 | 2025-12-05 00:46:43,426
[grpc-default-executor-0] INFO server.RaftServer$Division:
om3@group-D66704EFC61C: Failed appendEntries as previous log entry ((t:1,
i:1010)) is not found 2025-12-04 16:46:43.434 | 2025-12-05 00:46:43,433
[grpc-default-executor-0] INFO server.RaftServer$Division:
om3@group-D66704EFC61C: appendEntries* reply
om1<-om3#120165868:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=0,matchIndex=-1
2025-12-04 16:46:43.467 | 2025-12-05 00:46:43,466 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: receive
installSnapshot: om1->om3#0-t1,notify:(t:1, i:981) 2025-12-04 16:46:43.468 |
2025-12-05 00:46:43,468 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Received notification
to install snapshot at index 981 2025-12-04 16:46:43.474 | 2025-12-05
00:46:43,473 [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler:
om3@group-D66704EFC61C: notifyInstallSnapshot: nextIndex is 1 but the leader's
first available index is 981. 2025-12-04 16:46:43.474 | 2025-12-05 00:46:43,473
[grpc-default-executor-0] INFO ratis.OzoneManagerStateMachine: Received install
snapshot notification from OM leader: om1 with term index: (t:1, i:981)
2025-12-04 16:46:43.475 | 2025-12-05 00:46:43,474 [om3-InstallSnapshotThread]
INFO utils.RDBSnapshotProvider: Prepare to download the snapshot from leader OM
om1 and reloading state from the snapshot. 2025-12-04 16:46:43.481 | 2025-12-05
00:46:43,480 [om3-InstallSnapshotThread] INFO
ratis_snapshot.OmRatisSnapshotProvider: Downloading latest checkpoint from
Leader OM om1. Checkpoint: om.db-om1-1764895603475.tar URL:
http://om1:9874/v2/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
2025-12-04 16:46:43.514 | 2025-12-05 00:46:43,514 [om3-InstallSnapshotThread]
ERROR impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Failed to
notify StateMachine to InstallSnapshot. Exception:
java.lang.reflect.UndeclaredThrowableException 2025-12-04 16:46:43.545 |
2025-12-05 00:46:43,545 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: set new configuration
configurationEntry { peers { id: "om1" address: "om1:9872" startupRole:
FOLLOWER } peers { id: "om3" address: "om3:9872" startupRole: FOLLOWER } peers
{ id: "om2" address: "om2:9872" startupRole: FOLLOWER } } from snapshot
2025-12-04 16:46:43.545 | 2025-12-05 00:46:43,545 [grpc-default-executor-0]
INFO server.RaftServer$Division: om3@group-D66704EFC61C: set configuration
conf: {index: 0, cur=peers:[om1|om1:9872, om3|om3:9872,
om2|om2:9872]|listeners:[], old=null} 2025-12-04 16:46:43.549 | 2025-12-05
00:46:43,549 [grpc-default-executor-0] INFO ratis.OzoneManagerStateMachine:
notifyConfigurationChanged from Ratis: term=0, index=0, New Peer list:
om1(om1:9872), om3(om3:9872), om2(om2:9872), New Listener list 2025-12-04
16:46:43.549 | 2025-12-05 00:46:43,549 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: reply
installSnapshot: om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 2025-12-04
16:46:43.551 | 2025-12-05 00:46:43,551 [grpc-default-executor-0] INFO
server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT, lastRequest:
om1->om3#0-t1,notify:(t:1, i:981) 2025-12-04 16:46:43.552 | 2025-12-05
00:46:43,552 [grpc-default-executor-0] INFO server.GrpcServerProtocolService:
om3: Completed INSTALL_SNAPSHOT, lastReply: null {code}
was:
Triggered OM bootstrap on a lagging follower OM in the docker environment.
Install checkpoint never completes. Debug logging on the leader needs to be
improved.
```
Leader logs:
2025-12-04 16:46:48.440 | 2025-12-05 00:46:48,440
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.444 | 2025-12-05 00:46:48,443 [timer6] WARN
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-AppendLogResponseHandler:
received INCONSISTENCY reply with nextIndex 1, errorCount=1,
request=AppendEntriesRequest:cid=120166906,entriesCount=0 (Repeated 639 times
in the last 5.001s)
2025-12-04 16:46:48.445 | 2025-12-05 00:46:48,445
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.446 | 2025-12-05 00:46:48,446 [timer7] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
(Repeated 1040 times in the last 5.001s)
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,449 [grpc-default-executor-3]
WARN server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-AppendLogResponseHandler: received INCONSISTENCY
reply with nextIndex 1, errorCount=1,
request=AppendEntriesRequest:cid=120166908,entriesCount=0
2025-12-04 16:46:48.452 | 2025-12-05 00:46:48,452
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:48.455 | 2025-12-05 00:46:48,455
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om
...
2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,555 [timer1] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 (Repeated 1070 times in the
last 5.001s)
2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,556 [timer2] INFO
server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
(Repeated 1070 times in the last 5.002s)
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
INFO server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
INFO server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
...
2025-12-04 16:46:49.251 | 2025-12-05 00:46:49,251
[om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
WARN server.GrpcLogAppender:
om1@group-D66704EFC61C->om3-InstallSnapshotNotification: Failed InstallSnapshot
2025-12-04 16:46:49.285 | java.lang.IllegalStateException:
inProgressInstallSnapshotRequest: -1 is not eligible, firstAvailableLogIndex:
981
2025-12-04 16:46:49.285 | at
org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:65)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.notifyStateMachineToInstallSnapshot(SnapshotInstallationHandler.java:341)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:137)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:107)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1696)
2025-12-04 16:46:49.285 | at
org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:672)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:285)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:282)
2025-12-04 16:46:49.285 | at
org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:152)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:334)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:319)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
2025-12-04 16:46:49.285 | at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
2025-12-04 16:46:49.285 | at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2025-12-04 16:46:49.285 | at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2025-12-04 16:46:49.285 | at
java.base/java.lang.Thread.run(Thread.java:1583)
2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
WARN server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
Follower failed (request=null, errorCount=2); keep nextIndex (1) unchanged and
retry.
```
```
Follower logs keep repeating:
2025-12-04 16:46:43.427 | 2025-12-05 00:46:43,426 [grpc-default-executor-0]
INFO server.RaftServer$Division: om3@group-D66704EFC61C: Failed appendEntries
as previous log entry ((t:1, i:1010)) is not found
2025-12-04 16:46:43.434 | 2025-12-05 00:46:43,433 [grpc-default-executor-0]
INFO server.RaftServer$Division: om3@group-D66704EFC61C: appendEntries* reply
om1<-om3#120165868:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=0,matchIndex=-1
2025-12-04 16:46:43.467 | 2025-12-05 00:46:43,466 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: receive
installSnapshot: om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:43.468 | 2025-12-05 00:46:43,468 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Received
notification to install snapshot at index 981
2025-12-04 16:46:43.474 | 2025-12-05 00:46:43,473 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C:
notifyInstallSnapshot: nextIndex is 1 but the leader's first available index is
981.
2025-12-04 16:46:43.474 | 2025-12-05 00:46:43,473 [grpc-default-executor-0]
INFO ratis.OzoneManagerStateMachine: Received install snapshot notification
from OM leader: om1 with term index: (t:1, i:981)
2025-12-04 16:46:43.475 | 2025-12-05 00:46:43,474 [om3-InstallSnapshotThread]
INFO utils.RDBSnapshotProvider: Prepare to download the snapshot from leader OM
om1 and reloading state from the snapshot.
2025-12-04 16:46:43.481 | 2025-12-05 00:46:43,480 [om3-InstallSnapshotThread]
INFO ratis_snapshot.OmRatisSnapshotProvider: Downloading latest checkpoint from
Leader OM om1. Checkpoint: om.db-om1-1764895603475.tar URL:
http://om1:9874/v2/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
2025-12-04 16:46:43.514 | 2025-12-05 00:46:43,514 [om3-InstallSnapshotThread]
ERROR impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Failed to
notify StateMachine to InstallSnapshot. Exception:
java.lang.reflect.UndeclaredThrowableException
2025-12-04 16:46:43.545 | 2025-12-05 00:46:43,545 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: set new
configuration configurationEntry \{ peers { id: "om1" address: "om1:9872"
startupRole: FOLLOWER } peers \{ id: "om3" address: "om3:9872" startupRole:
FOLLOWER } peers \{ id: "om2" address: "om2:9872" startupRole: FOLLOWER } }
from snapshot
2025-12-04 16:46:43.545 | 2025-12-05 00:46:43,545 [grpc-default-executor-0]
INFO server.RaftServer$Division: om3@group-D66704EFC61C: set configuration
conf: \{index: 0, cur=peers:[om1|om1:9872, om3|om3:9872,
om2|om2:9872]|listeners:[], old=null}
2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549 [grpc-default-executor-0]
INFO ratis.OzoneManagerStateMachine: notifyConfigurationChanged from Ratis:
term=0, index=0, New Peer list: om1(om1:9872), om3(om3:9872), om2(om2:9872),
New Listener list
2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549 [grpc-default-executor-0]
INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: reply
installSnapshot: om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0
2025-12-04 16:46:43.551 | 2025-12-05 00:46:43,551 [grpc-default-executor-0]
INFO server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT,
lastRequest: om1->om3#0-t1,notify:(t:1, i:981)
2025-12-04 16:46:43.552 | 2025-12-05 00:46:43,552 [grpc-default-executor-0]
INFO server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT,
lastReply: null
```
> Install checkpoint on lagging OM fails during bootstrap
> -------------------------------------------------------
>
> Key: HDDS-14078
> URL: https://issues.apache.org/jira/browse/HDDS-14078
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Manager
> Reporter: Saketa Chalamchala
> Priority: Major
>
> Triggered OM bootstrap on a lagging follower OM in the docker environment.
> {code:java}
> 1. Start docker env with configs
> OZONE-SITE.XML_ozone.om.ratis.log.purge.gap=50
> OZONE-SITE.XML_ozone.om.ratis.segment.size=16KB
> OZONE-SITE.XML_ozone.om.ratis.segment.preallocated.size=16KB
> OZONE-SITE.XML_ozone.om.ratis.snapshot.auto.trigger.threshold=500
> 2. Stop a follower OM
> 3. Created 500 key metadata
> ozone freon omkg -t 10 -n 500 -v vol1 -b bucket1
> ozone sh snapshot create vol1/bucket1 snap-1
> ozone freon omkg -t 10 -n 100 -v vol1 -b bucket1
> ozone sh snapshot create vol1/bucket1 snap-2
> 4. Start the OM from Step2{code}
> Install checkpoint never completes. Debug logging on the leader needs to be
> improved.
> Leader logs:
> {code:java}
> 2025-12-04 16:46:48.440 | 2025-12-05 00:46:48,440
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.444 | 2025-12-05 00:46:48,443 [timer6] WARN
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-AppendLogResponseHandler:
> received INCONSISTENCY reply with nextIndex 1, errorCount=1,
> request=AppendEntriesRequest:cid=120166906,entriesCount=0 (Repeated 639
> times in the last 5.001s)
> 2025-12-04 16:46:48.445 | 2025-12-05 00:46:48,445
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.446 | 2025-12-05 00:46:48,446 [timer7] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
> (Repeated 1040 times in the last 5.001s)
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,449 [grpc-default-executor-3]
> WARN server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-AppendLogResponseHandler: received INCONSISTENCY
> reply with nextIndex 1, errorCount=1,
> request=AppendEntriesRequest:cid=120166908,entriesCount=0
> 2025-12-04 16:46:48.452 | 2025-12-05 00:46:48,452
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.455 | 2025-12-05 00:46:48,455
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om
> ...
> 2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,555 [timer1] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 (Repeated 1070 times in the
> last 5.001s)
> 2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,556 [timer2] INFO
> server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
> (Repeated 1070 times in the last 5.002s)
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
> INFO server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3]
> INFO server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> ...
> 2025-12-04 16:46:49.251 | 2025-12-05 00:46:49,251
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
> WARN server.GrpcLogAppender:
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: Failed
> InstallSnapshot
> 2025-12-04 16:46:49.285 | java.lang.IllegalStateException:
> inProgressInstallSnapshotRequest: -1 is not eligible, firstAvailableLogIndex:
> 981
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:65)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.server.impl.SnapshotInstallationHandler.notifyStateMachineToInstallSnapshot(SnapshotInstallationHandler.java:341)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:137)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:107)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1696)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:672)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:285)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:282)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:152)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:334)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:319)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> 2025-12-04 16:46:49.285 | at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> 2025-12-04 16:46:49.285 | at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> 2025-12-04 16:46:49.285 | at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> 2025-12-04 16:46:49.285 | at
> java.base/java.lang.Thread.run(Thread.java:1583)
> 2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3]
> WARN server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender:
> Follower failed (request=null, errorCount=2); keep nextIndex (1) unchanged
> and retry. {code}
> Follower logs:
> {code:java}
> Follower logs keep repeating: 2025-12-04 16:46:43.427 | 2025-12-05
> 00:46:43,426 [grpc-default-executor-0] INFO server.RaftServer$Division:
> om3@group-D66704EFC61C: Failed appendEntries as previous log entry ((t:1,
> i:1010)) is not found 2025-12-04 16:46:43.434 | 2025-12-05 00:46:43,433
> [grpc-default-executor-0] INFO server.RaftServer$Division:
> om3@group-D66704EFC61C: appendEntries* reply
> om1<-om3#120165868:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=0,matchIndex=-1
> 2025-12-04 16:46:43.467 | 2025-12-05 00:46:43,466 [grpc-default-executor-0]
> INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: receive
> installSnapshot: om1->om3#0-t1,notify:(t:1, i:981) 2025-12-04 16:46:43.468 |
> 2025-12-05 00:46:43,468 [grpc-default-executor-0] INFO
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Received
> notification to install snapshot at index 981 2025-12-04 16:46:43.474 |
> 2025-12-05 00:46:43,473 [grpc-default-executor-0] INFO
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C:
> notifyInstallSnapshot: nextIndex is 1 but the leader's first available index
> is 981. 2025-12-04 16:46:43.474 | 2025-12-05 00:46:43,473
> [grpc-default-executor-0] INFO ratis.OzoneManagerStateMachine: Received
> install snapshot notification from OM leader: om1 with term index: (t:1,
> i:981) 2025-12-04 16:46:43.475 | 2025-12-05 00:46:43,474
> [om3-InstallSnapshotThread] INFO utils.RDBSnapshotProvider: Prepare to
> download the snapshot from leader OM om1 and reloading state from the
> snapshot. 2025-12-04 16:46:43.481 | 2025-12-05 00:46:43,480
> [om3-InstallSnapshotThread] INFO ratis_snapshot.OmRatisSnapshotProvider:
> Downloading latest checkpoint from Leader OM om1. Checkpoint:
> om.db-om1-1764895603475.tar URL:
> http://om1:9874/v2/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
> 2025-12-04 16:46:43.514 | 2025-12-05 00:46:43,514
> [om3-InstallSnapshotThread] ERROR impl.SnapshotInstallationHandler:
> om3@group-D66704EFC61C: Failed to notify StateMachine to InstallSnapshot.
> Exception: java.lang.reflect.UndeclaredThrowableException 2025-12-04
> 16:46:43.545 | 2025-12-05 00:46:43,545 [grpc-default-executor-0] INFO
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: set new
> configuration configurationEntry { peers { id: "om1" address: "om1:9872"
> startupRole: FOLLOWER } peers { id: "om3" address: "om3:9872" startupRole:
> FOLLOWER } peers { id: "om2" address: "om2:9872" startupRole: FOLLOWER } }
> from snapshot 2025-12-04 16:46:43.545 | 2025-12-05 00:46:43,545
> [grpc-default-executor-0] INFO server.RaftServer$Division:
> om3@group-D66704EFC61C: set configuration conf: {index: 0,
> cur=peers:[om1|om1:9872, om3|om3:9872, om2|om2:9872]|listeners:[], old=null}
> 2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549 [grpc-default-executor-0]
> INFO ratis.OzoneManagerStateMachine: notifyConfigurationChanged from Ratis:
> term=0, index=0, New Peer list: om1(om1:9872), om3(om3:9872), om2(om2:9872),
> New Listener list 2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549
> [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler:
> om3@group-D66704EFC61C: reply installSnapshot:
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 2025-12-04 16:46:43.551 |
> 2025-12-05 00:46:43,551 [grpc-default-executor-0] INFO
> server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT,
> lastRequest: om1->om3#0-t1,notify:(t:1, i:981) 2025-12-04 16:46:43.552 |
> 2025-12-05 00:46:43,552 [grpc-default-executor-0] INFO
> server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT, lastReply:
> null {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]