[
https://issues.apache.org/jira/browse/HDDS-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-13234:
-----------------------------------
Description:
Found a bug where expired secret key can abort leader OM startup.
First, Leader OM crashed due to RATIS-1873.
And then, leader OM tried to start but failed:
{noformat}
2025-06-06 06:59:44,499 ERROR
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with
exception
java.lang.NullPointerException
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
at
org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
at
org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
at
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2025-06-06 06:59:44,503 INFO
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG:
{noformat}
What happened was:
1. OM loads delegation tokens when startup.
2. When loading delegation tokens, OM sends an inquiry to SCM asking for the
secret keys associated with the delegation tokens.
3. If the secret key already expires, SCM removed it and OM's request returns a
null, which results in a NullPointerException that aborts OM.
What we should do:
1. if secret key expires, ignore the delegation token, so that OM startup can
proceed.
2. OM delegation token secret manager will remove the dt later because it
expires.
was:
Found a bug where expired delegation tokens can abort leader OM startup.
First, Leader OM crashed due to RATIS-1873.
And then, leader OM tried to start but failed:
{noformat}
2025-06-06 06:59:44,499 ERROR
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with
exception
java.lang.NullPointerException
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
at
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
at
org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
at
org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
at
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
at
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2025-06-06 06:59:44,503 INFO
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG:
{noformat}
What happened was:
1. OM loads delegation tokens when startup.
2. When loading delegation tokens, OM sends an inquiry to SCM asking for the
secret keys associated with the delegation tokens.
3. If the secret key already expires, SCM removed it and OM's request returns a
null, which results in a NullPointerException that aborts OM.
What we should do:
1. if secret key expires, ignore the delegation token, so that OM startup can
proceed.
2. OM delegation token secret manager will remove the dt later because it
expires.
> Expired secret key can abort leader OM startup
> ----------------------------------------------
>
> Key: HDDS-13234
> URL: https://issues.apache.org/jira/browse/HDDS-13234
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Major
>
> Found a bug where expired secret key can abort leader OM startup.
> First, Leader OM crashed due to RATIS-1873.
> And then, leader OM tried to start but failed:
> {noformat}
> 2025-06-06 06:59:44,499 ERROR
> [main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with
> exception
> java.lang.NullPointerException
> at
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
> at
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
> at
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
> at
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
> at
> org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
> at
> org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
> at
> org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
> at
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
> at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
> at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
> at picocli.CommandLine.access$1300(CommandLine.java:145)
> at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
> at
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
> at picocli.CommandLine.execute(CommandLine.java:2078)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
> 2025-06-06 06:59:44,503 INFO
> [shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter:
> SHUTDOWN_MSG: {noformat}
> What happened was:
> 1. OM loads delegation tokens when startup.
> 2. When loading delegation tokens, OM sends an inquiry to SCM asking for the
> secret keys associated with the delegation tokens.
> 3. If the secret key already expires, SCM removed it and OM's request returns
> a null, which results in a NullPointerException that aborts OM.
> What we should do:
> 1. if secret key expires, ignore the delegation token, so that OM startup can
> proceed.
> 2. OM delegation token secret manager will remove the dt later because it
> expires.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]