[
https://issues.apache.org/jira/browse/HDFS-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044094#comment-16044094
]
Weiwei Yang edited comment on HDFS-11845 at 6/9/17 7:41 AM:
------------------------------------------------------------
This issue is because the RPC timeout was too small (100ms), the 1st RPC call
can't be done in 100ms on my cluster. Print the stack trace I see following
error in client side {{StorageContainerDatanodeProtocolClientSideTranslatorPB}}
{noformat}
com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call
From ozone1.fyre.ibm.com/172.16.165.133 to ozone1.fyre.ibm.com:9861 failed on
socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.16.165.133:40202
remote=ozone1.fyre.ibm.com/172.16.165.133:9861]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
at com.sun.proxy.$Proxy76.getVersion(Unknown Source)
at
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.getVersion(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:108)
at
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:52)
at
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
this caused the warning and the output error on the server side. Increase the
rpc timeout from 100 to 1000 fixed this issue. I think we should increase the
default timeout value for {{OZONE_SCM_HEARTBEAT_RPC_TIMEOUT}}, {{100ms}} is
just too aggressive. Uploaded a simple patch to fix this.
was (Author: cheersyang):
This issue is because the RPC timeout was too small (100ms), the 1st RPC call
can't be done in 100ms on my cluster. Print the stack trace I see following
error in client side {{StorageContainerDatanodeProtocolClientSideTranslatorPB}}
{noformat}
com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call
From ozone1.fyre.ibm.com/172.16.165.133 to ozone1.fyre.ibm.com:9861 failed on
socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/172.16.165.133:40202
remote=ozone1.fyre.ibm.com/172.16.165.133:9861]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
at com.sun.proxy.$Proxy76.getVersion(Unknown Source)
at
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.getVersion(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:108)
at
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:52)
at
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
this caused the warning and the output error on the server side. Increase the
rpc timeout from 100 to 1000 fixed this issue. I think we should increase the
default timeout value for {{OZONE_SCM_HEARTBEAT_RPC_TIMEOUT}}, {{100ms}} is
just too aggressive. Uploaded a patch to fix this.
> Ozone: Output error when DN handshakes with SCM
> -----------------------------------------------
>
> Key: HDFS-11845
> URL: https://issues.apache.org/jira/browse/HDFS-11845
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ozone
> Reporter: Weiwei Yang
> Assignee: Weiwei Yang
> Priority: Minor
>
> When start SCM and DN, there is always an error in SCM log
> {noformat}
> 17/05/17 15:19:59 WARN ipc.Server: IPC Server handler 9 on 9861, call Call#4
> Retry#0
> org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol.getVersion
> from 172.16.165.133:44824: output error
> 17/05/17 15:19:59 INFO ipc.Server: IPC Server handler 9 on 9861 caught an
> exception
> java.nio.channels.ClosedChannelException
> at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3216)
> at org.apache.hadoop.ipc.Server.access$1600(Server.java:135)
> at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1463)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1533)
> at
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2581)
> at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1605)
> at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:931)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:765)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]