[ 
https://issues.apache.org/jira/browse/HDDS-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDDS-13234:
-----------------------------------
    Description: 
Found a bug where expired secret key can abort leader OM startup.

First, Leader OM crashed due to RATIS-1873.

And then, leader OM tried to start but failed:
{noformat}
2025-06-06 06:59:44,499 ERROR 
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
exception
java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
        at 
org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
        at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
        at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2025-06-06 06:59:44,503 INFO 
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG: 
{noformat}

What happened was:
1. OM loads delegation tokens when startup.
2. When loading delegation tokens, OM sends an inquiry to SCM asking for the 
secret keys associated with the delegation tokens.
3. If the secret key already expires, SCM removed it and OM's request returns a 
null, which results in a NullPointerException that aborts OM.

What we should do:
1. if secret key expires, ignore the delegation token, so that OM startup can 
proceed.
2. OM delegation token secret manager will remove the dt later because it 
expires.

  was:
Found a bug where expired delegation tokens can abort leader OM startup.

First, Leader OM crashed due to RATIS-1873.

And then, leader OM tried to start but failed:
{noformat}
2025-06-06 06:59:44,499 ERROR 
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
exception
java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
        at 
org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
        at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
        at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2025-06-06 06:59:44,503 INFO 
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG: 
{noformat}

What happened was:
1. OM loads delegation tokens when startup.
2. When loading delegation tokens, OM sends an inquiry to SCM asking for the 
secret keys associated with the delegation tokens.
3. If the secret key already expires, SCM removed it and OM's request returns a 
null, which results in a NullPointerException that aborts OM.

What we should do:
1. if secret key expires, ignore the delegation token, so that OM startup can 
proceed.
2. OM delegation token secret manager will remove the dt later because it 
expires.


> Expired secret key can abort leader OM startup
> ----------------------------------------------
>
>                 Key: HDDS-13234
>                 URL: https://issues.apache.org/jira/browse/HDDS-13234
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>
> Found a bug where expired secret key can abort leader OM startup.
> First, Leader OM crashed due to RATIS-1873.
> And then, leader OM tried to start but failed:
> {noformat}
> 2025-06-06 06:59:44,499 ERROR 
> [main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
> exception
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
>         at 
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
>         at 
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
>         at 
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
>         at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>         at picocli.CommandLine.access$1300(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>         at picocli.CommandLine.execute(CommandLine.java:2078)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
> 2025-06-06 06:59:44,503 INFO 
> [shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: 
> SHUTDOWN_MSG: {noformat}
> What happened was:
> 1. OM loads delegation tokens when startup.
> 2. When loading delegation tokens, OM sends an inquiry to SCM asking for the 
> secret keys associated with the delegation tokens.
> 3. If the secret key already expires, SCM removed it and OM's request returns 
> a null, which results in a NullPointerException that aborts OM.
> What we should do:
> 1. if secret key expires, ignore the delegation token, so that OM startup can 
> proceed.
> 2. OM delegation token secret manager will remove the dt later because it 
> expires.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to