[jira] [Comment Edited] (HDDS-13353) SCM stuck in safe mode due to exceptions in node resolver

Ivan Andika (Jira) Tue, 08 Jul 2025 02:36:30 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-13353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003711#comment-18003711
 ]


Ivan Andika edited comment on HDDS-13353 at 7/8/25 9:35 AM:
------------------------------------------------------------

[~nanda] This happens in one of our internal 1.2 clusters (our internal 1.4 
clusters never encountered this issue). The Hadoop version is based on 3.3 
version.

{quote}It looks like ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand 
already handles Exception and returns null, which is handled in 
SCMNodeManager.{quote}

Yes, this should already have been handled properly. Previously I thought it 
was because of the runResolveCommand is throwing exception and terminating the 
SCM handler thread, but after delving into the code, it doesn't seem to be the 
case. I haven't been able to reproduce this issue, so we can put this as 
backburner first. I'll take a look more when I have time.


was (Author: JIRAUSER298977):
[~nanda] This happens in one of our internal 1.2 clusters (our internal 1.4 
clusters never encountered this issue). The Hadoop version is based on 3.3 
version.

{quote}It looks like ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand 
already handles Exception and returns null, which is handled in 
SCMNodeManager.{quote}

Yes, this should have been handled. Previously I thought it was because of the 
runResolveCommand is throwing exception and terminating the SCM handler thread, 
but after delving into the code, it doesn't seem to be the case. I haven't been 
able to reproduce this issue, so we can put this as backburner first. I'll take 
a look more when I have time.

> SCM stuck in safe mode due to exceptions in node resolver
> ---------------------------------------------------------
>
>                 Key: HDDS-13353
>                 URL: https://issues.apache.org/jira/browse/HDDS-13353
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Our cluster uses org.apache.hadoop.net.ScriptBasedMapping as our 
> net.topology.node.switch.mapping.impl implementation. 
> However, we encountered an issue such that when the 
> net.topology.script.file.name is pointed to the file that the SCM has no 
> access, the SCM register does not seem to respond to the datanode. This 
> causes SCM to be stuck in safe mode indefinitely since datanode cannot 
> (re-)register and therefore cannot send the subsequent container reports, 
> etc. Furthermore, the datanode simply reports that there is a 
> SocketTimeoutException, which might be misleading since it is because the SCM 
> does not respond at all and not because of the network issues.
> Note that the root cause is not fully confirmed yet.
> The SCM exceptions stack looks like
> {code:java}
> java.io.IOException: Cannot run program "/path/to/script.py" (in directory 
> "/<redacted>"): error=13, Permission denied
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
>         at org.apache.hadoop.util.Shell.run(Shell.java:901)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
>         at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:273)
>         at 
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:208)
>         at 
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
>         at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.nodeResolve(SCMNodeManager.java:1283)
>         at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:397)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.register(SCMDatanodeProtocolServer.java:231)
>         at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.register(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:85)
>         at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.processMessage(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:119)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:92)
>         at 
> org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos$StorageContainerDatanodeProtocolService$2.callBlockingMethod(StorageContainerDatanodeProtocolProtos.java:43636)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:491)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:611)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1300)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1193)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2031)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3587)
> Caused by: java.io.IOException: error=13, Permission denied
>         at java.lang.UNIXProcess.forkAndExec(Native Method)
>         at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
>         at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>         at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>         ... 24 more {code}
> The DN exceptions stack looks like
> {code:java}
> 2025-06-30 16:26:21,795 [EndpointStateMachine task thread for 
> <redacted>/<redacted> - 0 ] WARN 
> org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: 
> Unable to communicate to SCM server at <redacted>:9861 for past 0 seconds.
> java.net.SocketTimeoutException: Call From <redacted>/<redacted> to 
> <redacted>:9861 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 5000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/<redacted>:38530 remote=<redacted>/<redacted>9861]; For more details 
> see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:866)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1511)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
>         at com.sun.proxy.$Proxy38.submitRequest(Unknown Source)
>         at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:117)
>         at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:149)
>         at 
> org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:184)
>         at 
> org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:86)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketTimeoutException: 5000 millis timeout while waiting 
> for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/<redacted> 
> remote=o<redacted>/<redacted>:9861]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at java.io.FilterInputStream.read(FilterInputStream.java:83)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:524)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1908)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1182)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1071)
>  {code}
> Ideally, SCM should return an exception to DN so that DN can print the 
> message and maybe retry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-13353) SCM stuck in safe mode due to exceptions in node resolver

Reply via email to