[
https://issues.apache.org/jira/browse/ZOOKEEPER-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Nauroth updated ZOOKEEPER-2183:
-------------------------------------
Attachment: ZOOKEEPER-2183.003.patch
There wasn't anything wrong with the code of {{LocalSessionRequestTest}}. I
was assuming that each concurrent test process wouldn't bind and release
ephemeral ports so often that they could recycle quickly enough to cause a
collision. Unfortunately, that turned out to be a false assumption. I
analyzed the console output and saw a few instances of the same port getting
handed out at roughly the same time. This is made worse by the fact that
several tests stop and start a server while holding onto the same port for a
few seconds. (These are by design for covering things like client reconnect
logic.) During that downtime, the OS sees it as a free ephemeral port, and we
might hand it out elsewhere.
I'm attaching patch v003 with a different port assignment algorithm. This is
like the old monotonically increasing strategy, but it's aware of how many
total test threads are running concurrently and what is the thread index (i.e.
3 of 8) of the current test process. It uses that information to split the
available ports into N disjoint ranges, and each concurrent test process only
assigns ports from its own range. For extra resiliency, we still try to bind
to the port and keep retrying until we find a good one. Unfortunately, the
total threads and the thread index are not accessible through easy APIs, so we
need to jump through some hoops to get them. (See comments in the code for
details.)
I'm still getting some intermittent failures in reconfig related tests, like
{{ReconfigRecoveryTest}} and {{StandaloneDisabledTest}}. The problems don't
appear to be related to network ports. I see stuff like this in the logs:
{code}
[junit] 2015-05-12 15:42:54,246 [myid:] - ERROR
[Thread-7:QuorumPeerTestBase$MainThread@243] - unexpected exception in run
[junit]
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error
processing
/home/cnauroth/git/zookeeper/build/test/tmp/test940684720756473639.junit.dir/zoo.cfg.dynamic
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:162)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:240)
[junit] at java.lang.Thread.run(Thread.java:745)
[junit] Caused by: java.lang.IllegalArgumentException: standaloneEnabled =
false then number of participants should be >0
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseDynamicConfig(QuorumPeerConfig.java:537)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerConfig.setupQuorumPeerConfig(QuorumPeerConfig.java:504)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:157)
[junit] ... 3 more
{code}
That makes me think the reconfig tests are hitting a conflict on some other
shared resource in the runtime environment, like maybe the config directories.
I'm still investigating.
> Change test port assignments to improve uniqueness of ports for multiple
> concurrent test processes on the same host.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2183
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2183
> Project: ZooKeeper
> Issue Type: Improvement
> Components: tests
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: ZOOKEEPER-2183.001.patch, ZOOKEEPER-2183.002.patch,
> ZOOKEEPER-2183.003.patch, threads-change.patch
>
>
> Tests use {{PortAssignment#unique}} for assignment of the ports to bind
> during tests. Currently, this method works by using a monotonically
> increasing counter from a static starting point. Generally, this is
> sufficient to achieve uniqueness within a single JVM process, but it does not
> achieve uniqueness across multiple processes on the same host. This can
> cause tests to get bind errors if there are multiple pre-commit jobs running
> concurrently on the same Jenkins host. This also prevents running tests in
> parallel to improve the speed of pre-commit runs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)