[ 
https://issues.apache.org/jira/browse/HDDS-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043176#comment-18043176
 ] 

Saketa Chalamchala commented on HDDS-14078:
-------------------------------------------

Cannot reproduce this issue on an unsecure docker environment. This is only 
observed on the secure-ha docker environment. 

The log `LOG.info("Received GET request to obtain DB checkpoint snapshot");` is 
also not observed on the leader in the secure environment so, this is most 
likely an issue with docker configs and authentication setup for the bootstrap 
API behind the secure OM UI endpoint.

> Install checkpoint on lagging OM fails during bootstrap
> -------------------------------------------------------
>
>                 Key: HDDS-14078
>                 URL: https://issues.apache.org/jira/browse/HDDS-14078
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Manager
>            Reporter: Saketa Chalamchala
>            Priority: Major
>         Attachments: om-follower.log, om_leader.log
>
>
> Triggered OM bootstrap on a lagging follower OM in the docker environment.
> {code:java}
> 1. Start docker env with configs
>   OZONE-SITE.XML_ozone.om.ratis.log.purge.gap=50
>   OZONE-SITE.XML_ozone.om.ratis.segment.size=16KB
>   OZONE-SITE.XML_ozone.om.ratis.segment.preallocated.size=16KB
>   OZONE-SITE.XML_ozone.om.ratis.snapshot.auto.trigger.threshold=500
> 2. Stop a follower OM
> 3. Created 500 key metadata
>    ozone freon omkg -t 10 -n 500 -v vol1 -b bucket1
>    ozone sh snapshot create vol1/bucket1 snap-1
>    ozone freon omkg -t 10 -n 100 -v vol1 -b bucket1
>    ozone sh snapshot create vol1/bucket1 snap-2
> 4. Start the OM from Step2{code}
> Install checkpoint never completes. Debug logging on the leader needs to be 
> improved.
> Leader logs:
> {code:java}
> 2025-12-04 16:46:48.440 | 2025-12-05 00:46:48,440 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.444 | 2025-12-05 00:46:48,443 [timer6] WARN 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-AppendLogResponseHandler: 
> received INCONSISTENCY reply with nextIndex 1, errorCount=1, 
> request=AppendEntriesRequest:cid=120166906,entriesCount=0  (Repeated 639 
> times in the last 5.001s)
> 2025-12-04 16:46:48.445 | 2025-12-05 00:46:48,445 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.446 | 2025-12-05 00:46:48,446 [timer7] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: 
> notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1  
> (Repeated 1040 times in the last 5.001s)
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: 
> notifyInstallSnapshot with firstAvailable=(t:1, i:981), followerNextIndex=1 
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,448 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.449 | 2025-12-05 00:46:48,449 [grpc-default-executor-3] 
> WARN server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-AppendLogResponseHandler: received INCONSISTENCY 
> reply with nextIndex 1, errorCount=1, 
> request=AppendEntriesRequest:cid=120166908,entriesCount=0 
> 2025-12-04 16:46:48.452 | 2025-12-05 00:46:48,452 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:48.455 | 2025-12-05 00:46:48,455 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om
> ...
> 2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,555 [timer1] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply 
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0  (Repeated 1070 times in the 
> last 5.001s)
> 2025-12-04 16:46:48.556 | 2025-12-05 00:46:48,556 [timer2] INFO 
> server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress,  
> (Repeated 1070 times in the last 5.002s)
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3] 
> INFO server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: received a reply 
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0 
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 [grpc-default-executor-3] 
> INFO server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: in progress, 
> 2025-12-04 16:46:48.558 | 2025-12-05 00:46:48,558 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> ...
> 2025-12-04 16:46:49.251 | 2025-12-05 00:46:49,251 
> [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO 
> server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send 
> om1->om3#0-t1,notify:(t:1, i:981)
> 2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3] 
> WARN server.GrpcLogAppender: 
> om1@group-D66704EFC61C->om3-InstallSnapshotNotification: Failed 
> InstallSnapshot
> 2025-12-04 16:46:49.285 | java.lang.IllegalStateException: 
> inProgressInstallSnapshotRequest: -1 is not eligible, firstAvailableLogIndex: 
> 981
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:65)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.server.impl.SnapshotInstallationHandler.notifyStateMachineToInstallSnapshot(SnapshotInstallationHandler.java:341)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshotImpl(SnapshotInstallationHandler.java:137)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.server.impl.SnapshotInstallationHandler.installSnapshot(SnapshotInstallationHandler.java:107)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.server.impl.RaftServerImpl.installSnapshot(RaftServerImpl.java:1696)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.server.impl.RaftServerProxy.installSnapshot(RaftServerProxy.java:672)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:285)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.grpc.server.GrpcServerProtocolService$2.process(GrpcServerProtocolService.java:282)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.grpc.server.GrpcServerProtocolService$ServerRequestStreamObserver.onNext(GrpcServerProtocolService.java:152)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:334)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:319)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:834)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> 2025-12-04 16:46:49.285 |     at 
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> 2025-12-04 16:46:49.285 |     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> 2025-12-04 16:46:49.285 |     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> 2025-12-04 16:46:49.285 |     at 
> java.base/java.lang.Thread.run(Thread.java:1583)
> 2025-12-04 16:46:49.285 | 2025-12-05 00:46:49,285 [grpc-default-executor-3] 
> WARN server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: 
> Follower failed (request=null, errorCount=2); keep nextIndex (1) unchanged 
> and retry. {code}
> Follower logs:
> {code:java}
> Follower logs keep repeating: 2025-12-04 16:46:43.427 | 2025-12-05 
> 00:46:43,426 [grpc-default-executor-0] INFO server.RaftServer$Division: 
> om3@group-D66704EFC61C: Failed appendEntries as previous log entry ((t:1, 
> i:1010)) is not found 2025-12-04 16:46:43.434 | 2025-12-05 00:46:43,433 
> [grpc-default-executor-0] INFO server.RaftServer$Division: 
> om3@group-D66704EFC61C: appendEntries* reply 
> om1<-om3#120165868:FAIL-t1,INCONSISTENCY,nextIndex=1,followerCommit=0,matchIndex=-1
>  2025-12-04 16:46:43.467 | 2025-12-05 00:46:43,466 [grpc-default-executor-0] 
> INFO impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: receive 
> installSnapshot: om1->om3#0-t1,notify:(t:1, i:981)  2025-12-04 16:46:43.468 | 
> 2025-12-05 00:46:43,468 [grpc-default-executor-0] INFO 
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: Received 
> notification to install snapshot at index 981 2025-12-04 16:46:43.474 | 
> 2025-12-05 00:46:43,473 [grpc-default-executor-0] INFO 
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: 
> notifyInstallSnapshot: nextIndex is 1 but the leader's first available index 
> is 981. 2025-12-04 16:46:43.474 | 2025-12-05 00:46:43,473 
> [grpc-default-executor-0] INFO ratis.OzoneManagerStateMachine: Received 
> install snapshot notification from OM leader: om1 with term index: (t:1, 
> i:981) 2025-12-04 16:46:43.475 | 2025-12-05 00:46:43,474 
> [om3-InstallSnapshotThread] INFO utils.RDBSnapshotProvider: Prepare to 
> download the snapshot from leader OM om1 and reloading state from the 
> snapshot. 2025-12-04 16:46:43.481 | 2025-12-05 00:46:43,480 
> [om3-InstallSnapshotThread] INFO ratis_snapshot.OmRatisSnapshotProvider: 
> Downloading latest checkpoint from Leader OM om1. Checkpoint: 
> om.db-om1-1764895603475.tar URL: 
> http://om1:9874/v2/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
>  2025-12-04 16:46:43.514 | 2025-12-05 00:46:43,514 
> [om3-InstallSnapshotThread] ERROR impl.SnapshotInstallationHandler: 
> om3@group-D66704EFC61C: Failed to notify StateMachine to InstallSnapshot. 
> Exception: java.lang.reflect.UndeclaredThrowableException 2025-12-04 
> 16:46:43.545 | 2025-12-05 00:46:43,545 [grpc-default-executor-0] INFO 
> impl.SnapshotInstallationHandler: om3@group-D66704EFC61C: set new 
> configuration configurationEntry { peers { id: "om1" address: "om1:9872" 
> startupRole: FOLLOWER } peers { id: "om3" address: "om3:9872" startupRole: 
> FOLLOWER } peers { id: "om2" address: "om2:9872" startupRole: FOLLOWER } } 
> from snapshot 2025-12-04 16:46:43.545 | 2025-12-05 00:46:43,545 
> [grpc-default-executor-0] INFO server.RaftServer$Division: 
> om3@group-D66704EFC61C: set configuration conf: {index: 0, 
> cur=peers:[om1|om1:9872, om3|om3:9872, om2|om2:9872]|listeners:[], old=null} 
> 2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549 [grpc-default-executor-0] 
> INFO ratis.OzoneManagerStateMachine: notifyConfigurationChanged from Ratis: 
> term=0, index=0, New Peer list: om1(om1:9872), om3(om3:9872), om2(om2:9872), 
> New Listener list 2025-12-04 16:46:43.549 | 2025-12-05 00:46:43,549 
> [grpc-default-executor-0] INFO impl.SnapshotInstallationHandler: 
> om3@group-D66704EFC61C: reply installSnapshot: 
> om1<-om3#0:FAIL-t1,IN_PROGRESS,snapshotIndex=0  2025-12-04 16:46:43.551 | 
> 2025-12-05 00:46:43,551 [grpc-default-executor-0] INFO 
> server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT, 
> lastRequest: om1->om3#0-t1,notify:(t:1, i:981)  2025-12-04 16:46:43.552 | 
> 2025-12-05 00:46:43,552 [grpc-default-executor-0] INFO 
> server.GrpcServerProtocolService: om3: Completed INSTALL_SNAPSHOT, lastReply: 
> null   {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to