[ 
https://issues.apache.org/jira/browse/HDDS-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792181#comment-17792181
 ] 

Tsz-wo Sze commented on HDDS-9814:
----------------------------------

[~adoroszlai], thanks a lot for testing it!  The server shutdown happens first 
so that the clients try to reconnect.  We should shut down the clients before 
the servers.  Also, it should also shut down the executors

> NettyClientStreamRpc trying to establish connection during shutdown
> -------------------------------------------------------------------
>
>                 Key: HDDS-9814
>                 URL: https://issues.apache.org/jira/browse/HDDS-9814
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Attila Doroszlai
>            Priority: Major
>         Attachments: 2023-05-05T05-46-34_028-jvmRun1.dump, 
> TestOzoneFileSystemWithStreaming-output-RATIS-1948-ca54cdaf.txt, 
> TestOzoneFileSystemWithStreaming-output-pass.txt, 
> org.apache.hadoop.fs.ozone.TestOzoneFileSystemWithStreaming-output.txt
>
>
> {{TestOzoneFileSystemWithStreaming}} logs lots of {{"connection refused"}} 
> and {{"schedule reconnecting"}} messages during shutdown.
> {code:title=sample}
> 2023-11-30 18:39:00,039 [main] INFO  ozone.MiniOzoneClusterImpl 
> (MiniOzoneClusterImpl.java:shutdown(449)) - Shutting down the Mini Ozone 
> Cluster
> ...
> 2023-11-30 18:39:01,126 [Thread-1140] DEBUG client.NettyClientStreamRpc 
> (NettyClientStreamRpc.java:close(452)) - flush? true
> 2023-11-30 18:39:01,122 [Thread-1133] WARN  util.PeerProxyMap 
> (PeerProxyMap.java:closeProxy(170)) - 
> 2ad30381-d2ef-4801-b4d0-3965e2d4accc-NettyServerStreamRpc: Failed to close 
> proxy for peer 3c309c61-fe70-460a-bb5b-539134755074|127.0.0.1:15014, proxy 
> class: class org.apache.ratis.client.impl.DataStreamClientImpl
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  finishConnect(..) failed: Connection refused: /127.0.0.1:15015
> Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
> refused
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
>       at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
>       at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>       at 
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>       at java.lang.Thread.run(Thread.java:750)
> {code}
> This is during a normal, successful test run (logs: 
> [^TestOzoneFileSystemWithStreaming-output-pass.txt]).
> On a related note, test run occasionally times out during shutdown:
> {code}
> [INFO] Running org.apache.hadoop.fs.ozone.TestOzoneFileSystemWithStreaming
> ...
> [WARNING] Tests run: 1030, Failures: 0, Errors: 0, Skipped: 24
> ...
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M5:test (default-test) 
> on project ozone-integration-test: There was a timeout in the fork -> [Help 1]
> {code}
> (logs: 
> [^org.apache.hadoop.fs.ozone.TestOzoneFileSystemWithStreaming-output.txt],  
> [^2023-05-05T05-46-34_028-jvmRun1.dump])
> More recently I noticed job runner timeout.  Unfortunately we don't have logs 
> for that:
> {code}
> Thu, 30 Nov 2023 13:38:47 GMT
> [INFO] Running org.apache.hadoop.fs.ozone.TestOzoneFileSystemWithStreaming
> Thu, 30 Nov 2023 15:58:17 GMT
> Error: The operation was canceled.
> {code}
> CC [~szetszwo]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to