[
https://issues.apache.org/jira/browse/HDDS-13353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003711#comment-18003711
]
Ivan Andika edited comment on HDDS-13353 at 7/8/25 9:35 AM:
------------------------------------------------------------
[~nanda] This happens in one of our internal 1.2 clusters (our internal 1.4
clusters never encountered this issue). The Hadoop version is based on 3.3
version.
{quote}It looks like ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand
already handles Exception and returns null, which is handled in
SCMNodeManager.{quote}
Yes, this should already have been handled properly. Previously I thought it
was because of the runResolveCommand is throwing exception and terminating the
SCM handler thread, but after delving into the code, it doesn't seem to be the
case. I haven't been able to reproduce this issue, so we can put this as
backburner first. I'll take a look more when I have time.
was (Author: JIRAUSER298977):
[~nanda] This happens in one of our internal 1.2 clusters (our internal 1.4
clusters never encountered this issue). The Hadoop version is based on 3.3
version.
{quote}It looks like ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand
already handles Exception and returns null, which is handled in
SCMNodeManager.{quote}
Yes, this should have been handled. Previously I thought it was because of the
runResolveCommand is throwing exception and terminating the SCM handler thread,
but after delving into the code, it doesn't seem to be the case. I haven't been
able to reproduce this issue, so we can put this as backburner first. I'll take
a look more when I have time.
> SCM stuck in safe mode due to exceptions in node resolver
> ---------------------------------------------------------
>
> Key: HDDS-13353
> URL: https://issues.apache.org/jira/browse/HDDS-13353
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> Our cluster uses org.apache.hadoop.net.ScriptBasedMapping as our
> net.topology.node.switch.mapping.impl implementation.
> However, we encountered an issue such that when the
> net.topology.script.file.name is pointed to the file that the SCM has no
> access, the SCM register does not seem to respond to the datanode. This
> causes SCM to be stuck in safe mode indefinitely since datanode cannot
> (re-)register and therefore cannot send the subsequent container reports,
> etc. Furthermore, the datanode simply reports that there is a
> SocketTimeoutException, which might be misleading since it is because the SCM
> does not respond at all and not because of the network issues.
> Note that the root cause is not fully confirmed yet.
> The SCM exceptions stack looks like
> {code:java}
> java.io.IOException: Cannot run program "/path/to/script.py" (in directory
> "/<redacted>"): error=13, Permission denied
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
> at org.apache.hadoop.util.Shell.run(Shell.java:901)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
> at
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:273)
> at
> org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:208)
> at
> org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
> at
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.nodeResolve(SCMNodeManager.java:1283)
> at
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:397)
> at
> org.apache.hadoop.hdds.scm.server.SCMDatanodeProtocolServer.register(SCMDatanodeProtocolServer.java:231)
> at
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.register(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:85)
> at
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.processMessage(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:119)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolServerSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolServerSideTranslatorPB.java:92)
> at
> org.apache.hadoop.hdds.protocol.proto.StorageContainerDatanodeProtocolProtos$StorageContainerDatanodeProtocolService$2.callBlockingMethod(StorageContainerDatanodeProtocolProtos.java:43636)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:491)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:611)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1146)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1300)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1193)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2031)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3587)
> Caused by: java.io.IOException: error=13, Permission denied
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 24 more {code}
> The DN exceptions stack looks like
> {code:java}
> 2025-06-30 16:26:21,795 [EndpointStateMachine task thread for
> <redacted>/<redacted> - 0 ] WARN
> org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine:
> Unable to communicate to SCM server at <redacted>:9861 for past 0 seconds.
> java.net.SocketTimeoutException: Call From <redacted>/<redacted> to
> <redacted>:9861 failed on socket timeout exception:
> java.net.SocketTimeoutException: 5000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/<redacted>:38530 remote=<redacted>/<redacted>9861]; For more details
> see: http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:866)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
> at org.apache.hadoop.ipc.Client.call(Client.java:1511)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
> at com.sun.proxy.$Proxy38.submitRequest(Unknown Source)
> at
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:117)
> at
> org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:149)
> at
> org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:184)
> at
> org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:86)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketTimeoutException: 5000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/<redacted>
> remote=o<redacted>/<redacted>:9861]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:524)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1908)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1182)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1071)
> {code}
> Ideally, SCM should return an exception to DN so that DN can print the
> message and maybe retry.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]