[
https://issues.apache.org/jira/browse/HDDS-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760475#comment-17760475
]
Sammi Chen edited comment on HDDS-9156 at 8/30/23 2:49 PM:
-----------------------------------------------------------
Here is how the scm4.org check failed. On scm4.org side, it started at
11:11:50,
{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter:
STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG: host = scm4.org/172.25.0.120
STARTUP_MSG: args = []
STARTUP_MSG: version = 1.4.0-SNAPSHOT
{code}
It joined the SCM HA raft ring at round 11:12:08
{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive
installSnapshot:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1,
i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine:
leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO
server.RaftServer$Division:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from
null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot,
leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification
to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot
notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}
It received the new set configuration ratis command at 11:12:11, this new
configuration includes scm4.org
{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1]
INFO server.RaftServer$Division:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407:
peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
old=null
{code}
Then at 11:12:16, it received the rotation preparation command for the root
certificate 4,
{code:java}
2023-08-14 11:12:16,034
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare command of
root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] -
started.
{code}
And later, it received these four requests,
{code:java}
2023-08-14 11:12:16,063
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e
2023-08-14 11:12:16,098
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation commit command of
root certificate 4
{code}
And right after that, it failed during rotation commit command handling and
stopped itself.
{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm
key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO
security.RootCARotationManager: In-progress root CA directory
/data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is
interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO
client.SCMCertificateClient: Creating csr for
SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:[email protected]
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil:
Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil:
ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR
client.SCMCertificateClient: Error while fetching/storing SCM signed
certificate.
java.io.InterruptedIOException: Call interrupted
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
at org.apache.hadoop.ipc.Client.call(Client.java:1513)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
at
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
at
org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
ERROR security.RootCARotationHandlerImpl: Failed to move
/data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next ->
/data/metadata/scm/sub-ca
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
at
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
at java.base/java.nio.file.Files.move(Files.java:1422)
at
org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN
replication.UnhealthyReplicationProcessor: Under Replicated Processor
interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN
replication.UnhealthyReplicationProcessor: Over Replicated Processor
interrupted. Exiting...
2023-08-14 11:12:16,112
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO
replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes
{code}
scm2.org was the leader when scm4.org was starting. Here is some logs from
scm2, it accepted scm4.org, and sent out the set configuration request on
11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring
now. So in next root ca rotation, scm2 should expect 4 acks of rotation
preparation from all SCMs(including itself).
{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO
ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting
SetConfiguration request to Ratis server with new SCM peers list:
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO
server.RaftServer$Division:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive
setConfiguration
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
cid=21, seq=0, RW, null, SET_UNCONDITIONALLY,
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO
server.RaftServer$Division:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl:
startSetConfiguration
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
cid=21, seq=0, RW, null, SET_UNCONDITIONALLY,
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
listeners:[]
{code}
But on 11:12:10, it sent out the rotation commit request after received only 3
SCM's acks for rotation preparation.
{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO
security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation commit command of
root certificate 4
{code}
In the current implementation,
RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the
current peers count in the ratis group. It looks like ratis doesn't count
scm4.org in this failed acceptance test run.
was (Author: sammi):
Here is how the scm4.org check failed. On scm4.org side, it started at
11:11:50,
{code:java}
2023-08-14 11:11:50,970 [main] INFO server.StorageContainerManagerStarter:
STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG: host = scm4.org/172.25.0.120
STARTUP_MSG: args = []
STARTUP_MSG: version = 1.4.0-SNAPSHOT
{code}
It joined the SCM HA raft ring at round 11:12:08
{code:java}
2023-08-14 11:12:08,527 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: receive
installSnapshot:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c->39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f#0-t3,notify:(t:1,
i:0)
2023-08-14 11:12:08,661 [grpc-default-executor-0] INFO ha.SCMStateMachine:
leader changed, yet current SCM is still follower.
2023-08-14 11:12:08,662 [grpc-default-executor-0] INFO
server.RaftServer$Division:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: change Leader from
null to 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c at term 3 for installSnapshot,
leader elected after 11538ms
2023-08-14 11:12:08,745 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: Received notification
to install snapshot at index 0
2023-08-14 11:12:08,755 [grpc-default-executor-0] INFO
impl.SnapshotInstallationHandler:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: InstallSnapshot
notification result: ALREADY_INSTALLED, current snapshot index: -1
{code}
It received the new set configuration ratis command at 11:12:11, this new
configuration includes scm4.org
{code:java}
2023-08-14 11:12:11,541 [39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f-server-thread1]
INFO server.RaftServer$Division:
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30: set configuration 407:
peers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
old=null
{code}
Then at 11:12:16, it received the rotation preparation command for the root
certificate 4,
{code:java}
2023-08-14 11:12:16,034
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare command of
root certificate 4
....
2023-08-14 11:12:16,052 [RootCARotationManager] INFO
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] -
started.
{code}
And later, it received these four requests,
{code:java}
2023-08-14 11:12:16,063
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c
2023-08-14 11:12:16,067
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 85b682d7-8fe7-4853-84d4-764f5b803819
2023-08-14 11:12:16,094
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation prepare ack of root
certificate 4 from scm 24a88f8e-8ac3-49fd-946b-8313e286b62e
2023-08-14 11:12:16,098
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation commit command of
root certificate 4
{code}
And right after that, it failed during rotation commit command handling and
stopped itself.
{code:java}
2023-08-14 11:12:19,651 [RootCARotationManager] INFO
security.RootCARotationManager: SubCARotationPrepareTask[rootCertId = 4] - scm
key generated.
2023-08-14 11:12:19,653 [RootCARotationManager] INFO
security.RootCARotationManager: In-progress root CA directory
/data/metadata/scm/ca-next-progress is deleted for 'RootCARotationManager is
interrupted'
2023-08-14 11:12:19,654 [RootCARotationManager] INFO
client.SCMCertificateClient: Creating csr for
SCM->hostName:scm4.org,scmId:39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f,clusterId:CID-516c56b1-3e02-48ae-a9af-e8c128b9df30,subject:[email protected]
2023-08-14 11:12:19,694 [RootCARotationManager] INFO ozone.OzoneSecurityUtil:
Adding ip:172.25.0.120,host:scm4.org
2023-08-14 11:12:19,697 [RootCARotationManager] INFO ozone.OzoneSecurityUtil:
ip:127.0.0.1 not returned.
2023-08-14 11:12:19,868 [RootCARotationManager] ERROR
client.SCMCertificateClient: Error while fetching/storing SCM signed
certificate.
java.io.InterruptedIOException: Call interrupted
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1561)
at org.apache.hadoop.ipc.Client.call(Client.java:1513)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.getSCMCertChain(SCMSecurityProtocolClientSideTranslatorPB.java:233)
at
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient.signAndStoreCertificate(SCMCertificateClient.java:208)
at
org.apache.hadoop.hdds.scm.security.RootCARotationManager$SubCARotationPrepareTask.run(RootCARotationManager.java:587)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,102
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
ERROR security.RootCARotationHandlerImpl: Failed to move
/data/metadata/scm/sub-ca-next to /data/metadata/scm/sub-ca
java.nio.file.NoSuchFileException: /data/metadata/scm/sub-ca-next ->
/data/metadata/scm/sub-ca
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:417)
at
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:267)
at java.base/java.nio.file.Files.move(Files.java:1422)
at
org.apache.hadoop.hdds.scm.security.RootCARotationHandlerImpl.rotationCommit(RootCARotationHandlerImpl.java:140)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:188)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:148)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1777)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:242)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:184)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-14 11:12:16,103
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Container Balancer is not running.
2023-08-14 11:12:16,109
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping Replication Manager Service.
2023-08-14 11:12:16,109
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO replication.ReplicationManager: Stopping Replication Monitor Thread.
2023-08-14 11:12:16,109 [Under Replicated Processor] WARN
replication.UnhealthyReplicationProcessor: Under Replicated Processor
interrupted. Exiting...
2023-08-14 11:12:16,110 [Over Replicated Processor] WARN
replication.UnhealthyReplicationProcessor: Over Replicated Processor
interrupted. Exiting...
2023-08-14 11:12:16,112
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping the Datanode Admin Monitor.
2023-08-14 11:12:16,114 [ReplicationMonitor] INFO
replication.ReplicationManager: Replication Monitor Thread is stopped
2023-08-14 11:12:16,115
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.StorageContainerManager: Stopping datanode service RPC server
2023-08-14 11:12:16,115
[39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f@group-E8C128B9DF30-StateMachineUpdater]
INFO server.SCMDatanodeProtocolServer: Stopping the RPC server for DataNodes
{code}
scm2.org was the leader when scm4.org was starting. Here is some logs from
scm2, it accepted scm4.org, and sent out the set configuration request on
11:12:05. Since scm4.org was added, there were 4 peers in the SCM HA raft ring
now. So in next root ca rotation, scm2 should expect 4 acks of rotation
preparation from all SCMs(including itself).
{code:java}
2023-08-14 11:12:05,833 [IPC Server handler 76 on default port 9863] INFO
ha.SCMRatisServerImpl: 61b328f4-df7e-4623-a2fe-9ceb8ce81b7c: Submitting
SetConfiguration request to Ratis server with new SCM peers list:
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO
server.RaftServer$Division:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30: receive
setConfiguration
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
cid=21, seq=0, RW, null, SET_UNCONDITIONALLY,
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
listeners:[]
2023-08-14 11:12:05,834 [IPC Server handler 76 on default port 9863] INFO
server.RaftServer$Division:
61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-LeaderStateImpl:
startSetConfiguration
SetConfigurationRequest:client-E2E7B5FA32F0->61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30,
cid=21, seq=0, RW, null, SET_UNCONDITIONALLY,
servers:[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c|rpc:scm2.org:9894|admin:|client:|dataStream:|priority:2|startupRole:FOLLOWER,
85b682d7-8fe7-4853-84d4-764f5b803819|rpc:scm3.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
24a88f8e-8ac3-49fd-946b-8313e286b62e|rpc:scm1.org:9894|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER,
39a08bc6-78df-46ca-a6a4-77ae1a1aeb4f|rpc:scm4.org:9894|priority:0|startupRole:FOLLOWER],
listeners:[]
{code}
But on 11:12:10, it sent out the rotation commit request after received only 3
SCM's acks for rotation preparation.
{code:java}
2023-08-14 11:12:10,373 [RootCARotationManager-Active] INFO
security.RootCARotationManager: numFromHADetails 3, numFromRatisServer 3
2023-08-14 11:12:10,382
[61b328f4-df7e-4623-a2fe-9ceb8ce81b7c@group-E8C128B9DF30-StateMachineUpdater]
INFO security.RootCARotationHandlerImpl: Received rotation commit command of
root certificate 4
{code}
In the current implementation,
RaftServer.Division.getRaftConf().getCurrentPeers().size() is called to get the
current peers count in the ratis group. It looks like ratis doesn't count
scm4.org in.
> cert-rotation acceptance test is failing randomly
> -------------------------------------------------
>
> Key: HDDS-9156
> URL: https://issues.apache.org/jira/browse/HDDS-9156
> Project: Apache Ozone
> Issue Type: Bug
> Components: Certificates, test
> Reporter: Siyao Meng
> Assignee: Sammi Chen
> Priority: Major
> Labels: pull-request-available
>
> -Despite there being no "FAIL"ed test items:-
> https://github.com/apache/ozone/actions/runs/5825414635/job/15797755458?pr=5172#logs
> ah. {{ozone admin scm roles}} command timed out:
> {code}
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> {code}
> cc [~pifta]
> Another failure:
> https://github.com/apache/ozone/actions/runs/5827776199/job/15804891746
> {code}
> Creating ozonesecure-ha_scm4.org_1 ... done
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is not available on scm4.org yet
> Port 9894 is available on scm4.org
> ==============================================================================
> Kinit :: Kinit test user
> ==============================================================================
> Kinit | PASS |
> ------------------------------------------------------------------------------
> Kinit :: Kinit test user | PASS |
> 1 test, 1 passed, 0 failed
> ==============================================================================
> Output: /tmp/smoketest/ozonesecure-ha/result/robot-5.xml
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> ozone admin scm roles | grep scm4.org hasn't succeed yet
> Timed out waiting on ozone admin scm roles | grep scm4.org to be successful
> Stopping ozonesecure-ha_scm4.org_1 ...
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]