[
https://issues.apache.org/jira/browse/HDDS-13886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chirag goyal reassigned HDDS-13886:
-----------------------------------
Assignee: chirag goyal
> Improve log message for SCM bootstrap failure
> ---------------------------------------------
>
> Key: HDDS-13886
> URL: https://issues.apache.org/jira/browse/HDDS-13886
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Sammi Chen
> Assignee: chirag goyal
> Priority: Major
>
> Recently hit a SCM bootstrap failure case, due to a SCM property
> "ozone.scm.service.ids " is mis-spelled as "ozone.scm.service.id", here is
> the "ozone scm --bootstrap" output, "--loglevel DEBUG" also didn't show more
> info.
> {code:java}
> ************************************************************/
> 2025-11-04 01:54:05,805 [main] INFO server.StorageContainerManagerStarter:
> registered UNIX signal handlers for [SIGHUP, SIGINT, SIGTERM]
> 2025-11-04 01:54:05,935 [main] INFO reflections.Reflections: Reflections took
> 88 ms to scan 3 urls, producing 131 keys and 298 values
> 2025-11-04 01:54:05,978 [main] INFO ha.SCMHANodeDetails: ServiceID for
> StorageContainerManager is null
> 2025-11-04 01:54:05,978 [main] INFO ha.SCMHANodeDetails:
> ozone.scm.default.service.id is not defined, falling back to
> ozone.scm.service.ids to find serviceID for StorageContainerManager if it is
> HA enabled cluster
> 2025-11-04 01:54:05,989 [main] DEBUG security.SecurityUtil: Setting
> hadoop.security.token.service.use_ip to true
> 2025-11-04 01:54:06,212 [main] DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops,
> always=false, valueName=Time, about=, interval=10, type=DEFAULT,
> value=[GetGroups])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops,
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, value=[Rate
> of failed kerberos logins and latency (milliseconds)])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field
> org.apache.hadoop.metrics2.lib.MutableRate
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with
> annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops,
> always=false, valueName=Time, about=, interval=10, type=DEFAULT, value=[Rate
> of successful kerberos logins and latency (milliseconds)])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field private
> org.apache.hadoop.metrics2.lib.MutableGaugeInt
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures
> with annotation @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops,
> always=false, valueName=Time, about=, interval=10, type=DEFAULT,
> value=[Renewal failures since last successful login])
> 2025-11-04 01:54:06,215 [main] DEBUG lib.MutableMetricsFactory: field private
> org.apache.hadoop.metrics2.lib.MutableGaugeLong
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
> with annotation
> @org.apache.hadoop.metrics2.annotation.Metric(sampleName=Ops, always=false,
> valueName=Time, about=, interval=10, type=DEFAULT, value=[Renewal failures
> since startup])
> 2025-11-04 01:54:06,216 [main] DEBUG impl.MetricsSystemImpl: UgiMetrics, User
> and group related metrics
> 2025-11-04 01:54:06,228 [main] DEBUG security.Groups: Creating new Groups
> object
> 2025-11-04 01:54:06,229 [main] DEBUG util.PerformanceAdvisory: Falling back
> to shell based
> 2025-11-04 01:54:06,229 [main] DEBUG
> security.JniBasedUnixGroupsMappingWithFallback: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> 2025-11-04 01:54:06,230 [main] DEBUG security.Groups: Group mapping
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
> cacheTimeout=300000; warningDeltaMs=5000
> 2025-11-04 01:54:06,232 [main] DEBUG security.UserGroupInformation: Hadoop
> login
> 2025-11-04 01:54:06,233 [main] DEBUG security.UserGroupInformation: hadoop
> login commit
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: Using
> local user: UnixPrincipal: root
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: Using
> user: "UnixPrincipal: root" with name: root
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: User
> entry: "root"
> 2025-11-04 01:54:06,234 [main] DEBUG security.UserGroupInformation: UGI
> loginUser: root (auth:SIMPLE)
> Failed to get SCM info
> 2025-11-04 01:54:06,239 [shutdown-hook-0] INFO
> server.StorageContainerManagerStarter: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down StorageContainerManager at
> ozone-210-2.vpc.cloudera.com/10.65.55.1
> ************************************************************/
> 2025-11-04 01:54:06,240 [Thread-2] DEBUG util.ShutdownHookManager: Completed
> shutdown in 0.001 seconds; Timeouts: 0
> 2025-11-04 01:54:06,249 [Thread-2] DEBUG util.ShutdownHookManager:
> ShutdownHookManager completed shutdown.
> {code}
> The exception thrown by HAUtils.getScmInfo is not cached in
> StorageContainerManagerStarter#bootStrapScm. After catch the exception and
> this the stack,
> {code:java}
> 2025-11-04 02:24:57,491 [main] INFO server.StorageContainerManagerStarter:
> registered UNIX signal handlers for [SIGHUP, SIGINT, SIGTERM]
> 2025-11-04 02:24:57,619 [main] INFO reflections.Reflections: Reflections took
> 88 ms to scan 3 urls, producing 131 keys and 298 values
> 2025-11-04 02:24:57,659 [main] INFO ha.SCMHANodeDetails: ServiceID for
> StorageContainerManager is null
> 2025-11-04 02:24:57,659 [main] INFO ha.SCMHANodeDetails:
> ozone.scm.default.service.id is not defined, falling back to
> ozone.scm.service.ids to find serviceID for StorageContainerManager if it is
> HA enabled cluster
> 2025-11-04 02:24:57,916 [main] ERROR server.StorageContainerManager: Could
> not fetch SCM Info from existing SCM instances.
> java.io.IOException: Failed to get SCM info
> at org.apache.hadoop.hdds.utils.HAUtils.getScmInfo(HAUtils.java:103)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.scmBootstrap(StorageContainerManager.java:1174)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.bootStrap(StorageContainerManagerStarter.java:188)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.bootStrapScm(StorageContainerManagerStarter.java:131)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at picocli.CommandLine.executeUserObject(CommandLine.java:2070)
> at picocli.CommandLine.access$1500(CommandLine.java:148)
> at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2469)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2461)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2423)
> at
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
> at picocli.CommandLine$RunLast.execute(CommandLine.java:2425)
> at picocli.CommandLine.execute(CommandLine.java:2174)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:89)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:80)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:59)
> Caused by: java.lang.IllegalArgumentException: Does not contain a valid
> host:port authority:
> ozone-210-1.vpc.cloudera.com,ozone-210-2.vpc.cloudera.com,ozone-210-3.vpc.cloudera.com:9863
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:253)
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:226)
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:203)
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:182)
> at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:163)
> at
> org.apache.hadoop.hdds.scm.proxy.SCMFailoverProxyProviderBase.loadConfigs(SCMFailoverProxyProviderBase.java:158)
> at
> org.apache.hadoop.hdds.scm.proxy.SCMFailoverProxyProviderBase.<init>(SCMFailoverProxyProviderBase.java:111)
> at
> org.apache.hadoop.hdds.scm.proxy.SCMBlockLocationFailoverProxyProvider.<init>(SCMBlockLocationFailoverProxyProvider.java:35)
> at
> org.apache.hadoop.hdds.utils.HAUtils.getScmBlockClient(HAUtils.java:132)
> at org.apache.hadoop.hdds.utils.HAUtils.getScmInfo(HAUtils.java:99)
> ... 18 more
> scm bootstrap failed
> 2025-11-04 02:24:57,920 [shutdown-hook-0] INFO
> server.StorageContainerManagerStarter: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down StorageContainerManager at
> ozone-210-2.vpc.cloudera.com/10.65.55.1
> ************************************************************/
> {code}
> Finally I figured out that is because SCM property "ozone.scm.service.ids "
> is mis-spelled as "ozone.scm.service.id". The current bootstrap failure
> message is not clear enough for the issue investigation, we should improve
> it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]