Wei-Chiu Chuang created HDDS-13234:
--------------------------------------

             Summary: Expired delegation tokens can abort leader OM startup
                 Key: HDDS-13234
                 URL: https://issues.apache.org/jira/browse/HDDS-13234
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Wei-Chiu Chuang


Found a bug where expired delegation tokens can abort leader OM startup.

First, Leader OM crashed due to RATIS-1873.

And then, leader OM tried to start but failed:
{noformat}
2025-06-06 06:59:44,499 ERROR 
[main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: OM start failed with 
exception
java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.addPersistedDelegationToken(OzoneDelegationTokenSecretManager.java:575)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.loadTokenSecretState(OzoneDelegationTokenSecretManager.java:560)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.<init>(OzoneDelegationTokenSecretManager.java:112)
        at 
org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager$Builder.build(OzoneDelegationTokenSecretManager.java:131)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createDelegationTokenSecretManager(OzoneManager.java:1055)
        at 
org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:831)
        at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:674)
        at 
org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:759)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:189)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:86)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:74)
        at org.apache.hadoop.hdds.cli.GenericCli.call(GenericCli.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:103)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:94)
        at 
org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:58)
2025-06-06 06:59:44,503 INFO 
[shutdown-hook-0]-org.apache.hadoop.ozone.om.OzoneManagerStarter: SHUTDOWN_MSG: 
{noformat}

What happened was:
1. OM loads delegation tokens when startup.
2. When loading delegation tokens, OM sends an inquiry to SCM asking for the 
secret keys associated with the delegation tokens.
3. If the secret key already expires, SCM removed it and OM's request returns a 
null, which results in a NullPointerException that aborts OM.

What we should do:
1. if secret key expires, ignore the delegation token, so that OM startup can 
proceed.
2. OM delegation token secret manager will remove the dt later because it 
expires.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to