[jira] [Comment Edited] (HDFS-11845) Ozone: Output error when DN handshakes with SCM

Weiwei Yang (JIRA) Fri, 09 Jun 2017 00:42:33 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044094#comment-16044094
 ]


Weiwei Yang edited comment on HDFS-11845 at 6/9/17 7:41 AM:
------------------------------------------------------------

This issue is because the RPC timeout was too small (100ms), the 1st RPC call 
can't be done in 100ms on my cluster. Print the stack trace I see following 
error in client side {{StorageContainerDatanodeProtocolClientSideTranslatorPB}}

{noformat}
com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call 
From ozone1.fyre.ibm.com/172.16.165.133 to ozone1.fyre.ibm.com:9861 failed on 
socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout 
while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/172.16.165.133:40202 
remote=ozone1.fyre.ibm.com/172.16.165.133:9861]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
        at com.sun.proxy.$Proxy76.getVersion(Unknown Source)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.getVersion(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:108)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:52)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:30)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

this caused the warning and the output error on the server side. Increase the 
rpc timeout from 100 to 1000 fixed this issue. I think we should increase the 
default timeout value for {{OZONE_SCM_HEARTBEAT_RPC_TIMEOUT}}, {{100ms}} is 
just too aggressive. Uploaded a simple patch to fix this.


was (Author: cheersyang):
This issue is because the RPC timeout was too small (100ms), the 1st RPC call 
can't be done in 100ms on my cluster. Print the stack trace I see following 
error in client side {{StorageContainerDatanodeProtocolClientSideTranslatorPB}}

{noformat}
com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call 
From ozone1.fyre.ibm.com/172.16.165.133 to ozone1.fyre.ibm.com:9861 failed on 
socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout 
while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/172.16.165.133:40202 
remote=ozone1.fyre.ibm.com/172.16.165.133:9861]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:241)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:115)
        at com.sun.proxy.$Proxy76.getVersion(Unknown Source)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.getVersion(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:108)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:52)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:30)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

this caused the warning and the output error on the server side. Increase the 
rpc timeout from 100 to 1000 fixed this issue. I think we should increase the 
default timeout value for {{OZONE_SCM_HEARTBEAT_RPC_TIMEOUT}}, {{100ms}} is 
just too aggressive. Uploaded a patch to fix this.

> Ozone: Output error when DN handshakes with SCM
> -----------------------------------------------
>
>                 Key: HDFS-11845
>                 URL: https://issues.apache.org/jira/browse/HDFS-11845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Minor
>
> When start SCM and DN, there is always an error in SCM log
> {noformat}
> 17/05/17 15:19:59 WARN ipc.Server: IPC Server handler 9 on 9861, call Call#4 
> Retry#0 
> org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol.getVersion 
> from 172.16.165.133:44824: output error
> 17/05/17 15:19:59 INFO ipc.Server: IPC Server handler 9 on 9861 caught an 
> exception
> java.nio.channels.ClosedChannelException
>       at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
>       at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
>       at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3216)
>       at org.apache.hadoop.ipc.Server.access$1600(Server.java:135)
>       at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1463)
>       at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1533)
>       at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2581)
>       at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1605)
>       at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:931)
>       at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:765)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2659)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-11845) Ozone: Output error when DN handshakes with SCM

Reply via email to