[ 
https://issues.apache.org/jira/browse/HDDS-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag goyal reassigned HDDS-13886:
-----------------------------------

    Assignee: chirag goyal

> Improve log message for SCM bootstrap failure
> ---------------------------------------------
>
>                 Key: HDDS-13886
>                 URL: https://issues.apache.org/jira/browse/HDDS-13886
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Assignee: chirag goyal
>            Priority: Major
>
> Recently hit a SCM bootstrap failure case,  due to a SCM property 
> "ozone.scm.service.ids " is mis-spelled as "ozone.scm.service.id", here is 
> the "ozone scm --bootstrap" output, "--loglevel DEBUG" also didn't show more 
> info. 
> {code:java}
> ************************************************************/
> 2025-11-04 01:54:05,805 [main] INFO server.StorageContainerManagerStarter: 
> registered UNIX signal handlers for [SIGHUP, SIGINT, SIGTERM]
> 2025-11-04 01:54:05,935 [main] INFO reflections.Reflections: Reflections took 
> 88 ms to scan 3 urls, producing 131 keys and 298 values
> 2025-11-04 01:54:05,978 [main] INFO ha.SCMHANodeDetails: ServiceID for 
> StorageContainerManager is null
> 2025-11-04 01:54:05,978 [main] INFO ha.SCMHANodeDetails: 
> ozone.scm.default.service.id is not defined, falling back to 
> ozone.scm.service.ids to find serviceID for StorageContainerManager if it is 
> HA enabled cluster
> 2025-11-04 01:54:05,989 [main] DEBUG security.SecurityUtil: Setting 
> hadoop.security.token.service.use_ip to true
> 2025-11-04 01:54:06,212 [main] DEBUG lib.MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, 
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, 
> value=[GetGroups])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, 
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, value=[Rate 
> of failed kerberos logins and latency (milliseconds)])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, 
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, value=[Rate 
> of successful kerberos logins and latency (milliseconds)])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeInt 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures 
> with annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, 
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, 
> value=[Renewal failures since last successful login])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
>  with annotation 
> @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, always=false, 
> valueName=Time, about=, interval=10, type=DEFAULT, value=[Renewal failures 
> since startup])
> 2025-11-04 01:54:06,216 [main] DEBUG impl.MetricsSystemImpl: UgiMetrics, User 
> and group related metrics
> 2025-11-04 01:54:06,228 [main] DEBUG security.Groups:  Creating new Groups 
> object
> 2025-11-04 01:54:06,229 [main] DEBUG util.PerformanceAdvisory: Falling back 
> to shell based
> 2025-11-04 01:54:06,229 [main] DEBUG 
> security.JniBasedUnixGroupsMappingWithFallback: Group mapping 
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> 2025-11-04 01:54:06,230 [main] DEBUG security.Groups: Group mapping 
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; 
> cacheTimeout=300000; warningDeltaMs=5000
> 2025-11-04 01:54:06,232 [main] DEBUG security.UserGroupInformation: Hadoop 
> login
> 2025-11-04 01:54:06,233 [main] DEBUG security.UserGroupInformation: hadoop 
> login commit
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: Using 
> local user: UnixPrincipal: root
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: Using 
> user: "UnixPrincipal: root" with name: root
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: User 
> entry: "root"
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: UGI 
> loginUser: root (auth:SIMPLE)
> Failed to get SCM info
> 2025-11-04 01:54:06,239 [shutdown-hook-0] INFO 
> server.StorageContainerManagerStarter: SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down StorageContainerManager at 
> ozone-210-2.vpc.cloudera.com/10.65.55.1
> ************************************************************/
> 2025-11-04 01:54:06,240 [Thread-2] DEBUG util.ShutdownHookManager: Completed 
> shutdown in 0.001 seconds; Timeouts: 0
> 2025-11-04 01:54:06,249 [Thread-2] DEBUG util.ShutdownHookManager: 
> ShutdownHookManager completed shutdown.
> {code}
> The exception thrown by HAUtils.getScmInfo is not cached in 
> StorageContainerManagerStarter#bootStrapScm. After catch the exception and 
> this the stack, 
> {code:java}
> 2025-11-04 02:24:57,491 [main] INFO server.StorageContainerManagerStarter: 
> registered UNIX signal handlers for [SIGHUP, SIGINT, SIGTERM]
> 2025-11-04 02:24:57,619 [main] INFO reflections.Reflections: Reflections took 
> 88 ms to scan 3 urls, producing 131 keys and 298 values
> 2025-11-04 02:24:57,659 [main] INFO ha.SCMHANodeDetails: ServiceID for 
> StorageContainerManager is null
> 2025-11-04 02:24:57,659 [main] INFO ha.SCMHANodeDetails: 
> ozone.scm.default.service.id is not defined, falling back to 
> ozone.scm.service.ids to find serviceID for StorageContainerManager if it is 
> HA enabled cluster
> 2025-11-04 02:24:57,916 [main] ERROR server.StorageContainerManager: Could 
> not fetch SCM Info from existing SCM instances.
> java.io.IOException: Failed to get SCM info
>       at org.apache.hadoop.hdds.utils.HAUtils.getScmInfo(HAUtils.java:103)
>       at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.scmBootstrap(StorageContainerManager.java:1174)
>       at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.bootStrap(StorageContainerManagerStarter.java:188)
>       at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.bootStrapScm(StorageContainerManagerStarter.java:131)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at picocli.CommandLine.executeUserObject(CommandLine.java:2070)
>       at picocli.CommandLine.access$1500(CommandLine.java:148)
>       at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2469)
>       at picocli.CommandLine$RunLast.handle(CommandLine.java:2461)
>       at picocli.CommandLine$RunLast.handle(CommandLine.java:2423)
>       at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>       at picocli.CommandLine$RunLast.execute(CommandLine.java:2425)
>       at picocli.CommandLine.execute(CommandLine.java:2174)
>       at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:89)
>       at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:80)
>       at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:59)
> Caused by: java.lang.IllegalArgumentException: Does not contain a valid 
> host:port authority: 
> ozone-210-1.vpc.cloudera.com,ozone-210-2.vpc.cloudera.com,ozone-210-3.vpc.cloudera.com:9863
>       at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:253)
>       at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:226)
>       at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:203)
>       at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:182)
>       at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:163)
>       at 
> org.apache.hadoop.hdds.scm.proxy.SCMFailoverProxyProviderBase.loadConfigs(SCMFailoverProxyProviderBase.java:158)
>       at 
> org.apache.hadoop.hdds.scm.proxy.SCMFailoverProxyProviderBase.<init>(SCMFailoverProxyProviderBase.java:111)
>       at 
> org.apache.hadoop.hdds.scm.proxy.SCMBlockLocationFailoverProxyProvider.<init>(SCMBlockLocationFailoverProxyProvider.java:35)
>       at 
> org.apache.hadoop.hdds.utils.HAUtils.getScmBlockClient(HAUtils.java:132)
>       at org.apache.hadoop.hdds.utils.HAUtils.getScmInfo(HAUtils.java:99)
>       ... 18 more
> scm bootstrap failed
> 2025-11-04 02:24:57,920 [shutdown-hook-0] INFO 
> server.StorageContainerManagerStarter: SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down StorageContainerManager at 
> ozone-210-2.vpc.cloudera.com/10.65.55.1
> ************************************************************/
> {code}
> Finally I figured out that is because SCM property "ozone.scm.service.ids " 
> is mis-spelled as "ozone.scm.service.id".  The current bootstrap failure 
> message is not clear enough for the issue investigation, we should improve 
> it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to