[ 
https://issues.apache.org/jira/browse/HDDS-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760475#comment-17760475
 ] 

Sammi Chen edited comment on HDDS-9156 at 8/30/23 2:49 PM:
-----------------------------------------------------------

Here is how the scm4.org check failed.   On scm4.org side, it started at 
11:11:50, 


{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter: 
STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = scm4.org/172.25.0.120
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.4.0-SNAPSHOT
{code}

It joined the SCM HA raft ring at round 11:12:08


{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive 
installSnapshot: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1,
 i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine: 
leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO 
server.RaftServer$Division: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from 
null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot, 
leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification 
to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot 
notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}

It received the new set configuration ratis command at 11:12:11, this new 
configuration includes scm4.org


{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1] 
INFO server.RaftServer$Division: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407: 
peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
 old=null
{code}

Then at 11:12:16, it received the rotation preparation command for the root 
certificate 4,


{code:java}
2023-08-14 11:12:16,034 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare command of 
root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO 
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - 
started.
{code}

And later, it received these four requests,


{code:java}
2023-08-14 11:12:16,063 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e


2023-08-14 11:12:16,098 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation commit command of 
root certificate 4
{code}

And right after that, it failed during rotation commit command handling and 
stopped itself. 

{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO 
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm 
key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO 
security.RootCARotationManager: In-progress root CA directory 
/data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is 
interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO 
client.SCMCertificateClient: Creating csr for 
SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:[email protected]
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: 
Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: 
ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR 
client.SCMCertificateClient: Error while fetching/storing SCM signed 
certificate.
java.io.InterruptedIOException: Call interrupted
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
        at org.apache.hadoop.ipc.Client.call(Client.java:1513)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
        at 
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
        at 
org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
ERROR security.RootCARotationHandlerImpl: Failed to move 
/data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next -> 
/data/metadata/scm/sub-ca
        at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1422)
        at 
org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN 
replication.UnhealthyReplicationProcessor: Under Replicated Processor 
interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN 
replication.UnhealthyReplicationProcessor: Over Replicated Processor 
interrupted. Exiting...
2023-08-14 11:12:16,112 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO 
replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes

{code}

scm2.org was the leader when scm4.org was starting. Here is some logs from 
scm2, it accepted scm4.org, and sent out the set configuration request on 
11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring 
now. So in next root ca rotation, scm2 should expect 4 acks of rotation 
preparation from all SCMs(including itself).

{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO 
ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting 
SetConfiguration request to Ratis server with new SCM peers list: 
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO 
server.RaftServer$Division: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive 
setConfiguration 
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
 cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, 
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
 listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO 
server.RaftServer$Division: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl: 
startSetConfiguration 
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
 cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, 
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
 listeners:[]
{code}

But on 11:12:10, it sent out the rotation commit request after received only 3 
SCM's acks for rotation preparation.  

{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO 
security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382 
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation commit command of 
root certificate 4
{code}

In the current implementation, 
RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the 
current peers count in the ratis group. It looks like ratis doesn't count 
scm4.org in this failed acceptance test run. 



 




was (Author: sammi):
Here is how the scm4.org check failed.   On scm4.org side, it started at 
11:11:50, 


{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter: 
STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = scm4.org/172.25.0.120
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.4.0-SNAPSHOT
{code}

It joined the SCM HA raft ring at round 11:12:08


{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive 
installSnapshot: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1,
 i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine: 
leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO 
server.RaftServer$Division: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from 
null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot, 
leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification 
to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO 
impl.SnapshotInstallationHandler: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot 
notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}

It received the new set configuration ratis command at 11:12:11, this new 
configuration includes scm4.org


{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1] 
INFO server.RaftServer$Division: 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407: 
peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
 old=null
{code}

Then at 11:12:16, it received the rotation preparation command for the root 
certificate 4,


{code:java}
2023-08-14 11:12:16,034 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare command of 
root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO 
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - 
started.
{code}

And later, it received these four requests,


{code:java}
2023-08-14 11:12:16,063 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root 
certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e


2023-08-14 11:12:16,098 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation commit command of 
root certificate 4
{code}

And right after that, it failed during rotation commit command handling and 
stopped itself. 

{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO 
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm 
key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO 
security.RootCARotationManager: In-progress root CA directory 
/data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is 
interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO 
client.SCMCertificateClient: Creating csr for 
SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:[email protected]
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: 
Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil: 
ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR 
client.SCMCertificateClient: Error while fetching/storing SCM signed 
certificate.
java.io.InterruptedIOException: Call interrupted
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
        at org.apache.hadoop.ipc.Client.call(Client.java:1513)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
        at 
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
        at 
org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
ERROR security.RootCARotationHandlerImpl: Failed to move 
/data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next -> 
/data/metadata/scm/sub-ca
        at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
        at java.base/java.nio.file.Files.move(Files.java:1422)
        at 
org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN 
replication.UnhealthyReplicationProcessor: Under Replicated Processor 
interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN 
replication.UnhealthyReplicationProcessor: Over Replicated Processor 
interrupted. Exiting...
2023-08-14 11:12:16,112 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO 
replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115 
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater] 
INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes

{code}

scm2.org was the leader when scm4.org was starting. Here is some logs from 
scm2, it accepted scm4.org, and sent out the set configuration request on 
11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring 
now. So in next root ca rotation, scm2 should expect 4 acks of rotation 
preparation from all SCMs(including itself).

{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO 
ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting 
SetConfiguration request to Ratis server with new SCM peers list: 
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO 
server.RaftServer$Division: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive 
setConfiguration 
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
 cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, 
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
 listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO 
server.RaftServer$Division: 
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl: 
startSetConfiguration 
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
 cid=21, seq=0, RW, null, SET_UNCONDITIONALLY, 
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
 
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
 
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
 listeners:[]
{code}

But on 11:12:10, it sent out the rotation commit request after received only 3 
SCM's acks for rotation preparation.  

{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO 
security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382 
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater] 
INFO security.RootCARotationHandlerImpl: Received rotation commit command of 
root certificate 4
{code}

In the current implementation, 
RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the 
current peers count in the ratis group. It looks like ratis doesn't count 
scm4.org in.  



 



> cert-rotation acceptance test is failing randomly
> -------------------------------------------------
>
>                 Key: HDDS-9156
>                 URL: https://issues.apache.org/jira/browse/HDDS-9156
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Certificates, test
>            Reporter: Siyao Meng
>            Assignee: Sammi Chen
>            Priority: Major
>              Labels: pull-request-available
>
> -Despite there being no "FAIL"ed test items:-
> https://github.com/apache/ozone/actions/runs/5825414635/job/15797755458?pr=5172#logs
> ah. {{ozone admin scm roles}} command timed out:
> {code}
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> {code}
> cc [~pifta]
> Another failure:
> https://github.com/apache/ozone/actions/runs/5827776199/job/15804891746
> {code}
> Creating ozonesecure-ha_scm4.org_1 ... done
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is available on scm4.org
> ==============================================================================
> Kinit :: Kinit test user                                                      
> ==============================================================================
> Kinit                                                                 | PASS |
> ------------------------------------------------------------------------------
> Kinit :: Kinit test user                                              | PASS |
> 1 test, 1 passed, 0 failed
> ==============================================================================
> Output:  /tmp/smoketest/ozonesecure-ha/result/robot-5.xml
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> Stopping ozonesecure-ha_scm4.org_1  ... 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to