[
https://issues.apache.org/jira/browse/RATIS-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294732#comment-17294732
]
Chris Nauroth commented on RATIS-1147:
--------------------------------------
I've been investigating this. Here are my observations.
* This reproduces consistently when running the test on Mac.
* I cannot reproduce when running on Linux (the Docker container started by
start-build-env.sh).
* It seems to be triggered by tests allocating a large number of servers (e.g.
{{TestMultiRaftGroup}} starting 15 peers).
* After debugging, I've traced it to the
{{NetUtils#createLocalServerAddress()}} method, which operates by binding a
socket to an ephemeral port identified by the OS, then immediately closing it,
and returning the resulting address for the mini-cluster to use. (This
technique always has a risk of a time-of-check/time-of-use race condition, but
it's the best we can do, and it should be rare.) Under tests like
{{TestMultiRaftGroup}} allocating a large volume of ports, this method
occasionally fails to close a socket silently. {{netstat}} continues to report
a socket bound on the address in {{LISTEN}} state, even after the code closed
it.
* This is very strange, so it made me suspect a Mac-specific JDK bug. I traced
all the way through the JDK code and confirmed that it's going all the way down
to the native {{socketClose0}} function successfully. Yet, somehow, the socket
remains! I don't see significant differences between the Linux and Mac native
code implementation (linux_close.c vs. bsd_close.c in the OpenJDK tree). I
couldn't find any relevant OpenJDK bug reports.
The good news though is that I found a workaround. If we bind explicitly to the
loopback interface (instead of wildcard), then I no longer see these sockets
leaking. This must be some kind of edge case specific to Mac and sockets bound
to wildcard. Binding to loopback is probably the better thing to do anyway,
because there is no reason for these test sockets to be externally accessible.
I submitted a pull request. With this patch, I have all tests passing on Mac.
> TestMultiRaftGroup on MiniRaftClusterWithNetty is failing consistently
> ----------------------------------------------------------------------
>
> Key: RATIS-1147
> URL: https://issues.apache.org/jira/browse/RATIS-1147
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Hanisha Koneru
> Assignee: Chris Nauroth
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The TestMultiRaftGroup test on MiniRaftClusterWithNetty fails with following
> exception:
> {code:java}
> [INFO] Running org.apache.ratis.TestMultiRaftGroup
> [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 19.134 s <<< FAILURE! - in org.apache.ratis.TestMultiRaftGroup
> [ERROR] testMultiRaftGroup[2](org.apache.ratis.TestMultiRaftGroup) Time
> elapsed: 1.136 s <<< ERROR!
> java.io.IOException: s11: Failed to start NettyRpcService
> at
> org.apache.ratis.netty.server.NettyRpcService.startImpl(NettyRpcService.java:141)
> at
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:222)
> at
> org.apache.ratis.server.impl.RaftServerRpcWithProxy.start(RaftServerRpcWithProxy.java:70)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$start$3(RaftServerProxy.java:323)
> at
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:222)
> at
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:321)
> at
> org.apache.ratis.MiniRaftCluster.startServers(MiniRaftCluster.java:426)
> at org.apache.ratis.MiniRaftCluster.start(MiniRaftCluster.java:294)
> at
> org.apache.ratis.server.impl.GroupManagementBaseTest.runMultiGroupTest(GroupManagementBaseTest.java:256)
> at
> org.apache.ratis.TestMultiRaftGroup.runTestMultiRaftGroup(TestMultiRaftGroup.java:73)
> at
> org.apache.ratis.TestMultiRaftGroup.runTestMultiRaftGroup(TestMultiRaftGroup.java:59)
> at
> org.apache.ratis.TestMultiRaftGroup.testMultiRaftGroup(TestMultiRaftGroup.java:55)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:433)
> at sun.nio.ch.Net.bind(Net.java:425)
> at
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> at
> org.apache.ratis.thirdparty.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
> at
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
> at
> org.apache.ratis.thirdparty.io.netty.handler.logging.LoggingHandler.bind(LoggingHandler.java:221)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
> at
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
> at
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
> at
> org.apache.ratis.thirdparty.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> at
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> ... 1 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)