duongkame opened a new pull request, #5068:
URL: https://github.com/apache/ozone/pull/5068
## What changes were proposed in this pull request?
When a datanode/OM starts up while SCM has not finished initializing secret
keys yet, the startup fails because Datanode/OM needs to prefetch the current
active secret keys.
```
2023-07-13 16:06:22,369 [main] ERROR
org.apache.hadoop.ozone.HddsDatanodeService: Exception in HddsDatanodeService.
java.lang.RuntimeException: Can't start the HDDS datanode plugin
at
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:361)
at
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
at
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:203)
at
org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:93)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
at
org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:185)
Caused by: org.apache.hadoop.hdds.security.exception.SCMSecretKeyException:
Secret key initialization is not finished yet.
at
org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.handleError(SecretKeyProtocolClientSideTranslatorPB.java:101)
at
org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.submitRequest(SecretKeyProtocolClientSideTranslatorPB.java:89)
at
org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.getCurrentSecretKey(SecretKeyProtocolClientSideTranslatorPB.java:127)
at
org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.start(DefaultSecretKeySignerClient.java:72)
at
org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeyClient.start(DefaultSecretKeyClient.java:50)
at
org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:312)
... 13 more
2023-07-13 16:06:22,382 [shutdown-hook-0] INFO
org.apache.hadoop.ozone.HddsDatanodeService: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down HddsDatanodeService at ....
************************************************************/
```
The current active secret key is mandatory for Datanode/OM startup because
they need it in their background processes, e.g. EC reconstruction. The
solution is to apply retries to the active secret key prefetch if SCM has not
initialized secret keys yet.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9020
Please replace this section with the link to the Apache JIRA)
## How was this patch tested?
I added a custom code to SCM to delay the secret key initialization, started
a docker cluster and verified datanodes/OMs retry on SCM secret key exceptions
until succeed.
```
2023-07-14 17:45:18,748 [main] INFO
symmetric.DefaultSecretKeyVerifierClient: Initializing secret key cache with
size 16, TTL PT168H
2023-07-14 17:45:18,875 [main] INFO utils.RetriableTask: Execution of task
getCurrentSecretKey failed, will be retried in 4000 ms
2023-07-14 17:45:22,882 [main] INFO utils.RetriableTask: Execution of task
getCurrentSecretKey failed, will be retried in 6000 ms
2023-07-14 17:45:28,895 [main] INFO utils.RetriableTask: Execution of task
getCurrentSecretKey failed, will be retried in 19000 ms
2023-07-14 17:45:47,938 [main] INFO utils.RetriableTask: Execution of task
getCurrentSecretKey failed, will be retried in 32000 ms
2023-07-14 17:46:19,972 [main] INFO utils.RetriableTask: Execution of task
getCurrentSecretKey failed, will be retried in 91000 ms
2023-07-14 17:47:51,007 [main] INFO symmetric.DefaultSecretKeySignerClient:
Initial secret key fetched from SCM: SecretKey(id =
f937b502-faca-4505-a894-72c19dde972b, creation at: 2023-07-14T17:47:16.371Z,
expire at: 2023-07-21T17:47:16.371Z).
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]