GitHub user neykov opened a pull request:
https://github.com/apache/incubator-brooklyn/pull/662
Fix for the "unable to find a free port" Jenkins error
The fix
----------
If the free port check fails for a specific interface, do an additional
check if able to bind to an OS picked port on the same IP. If it fails, then
there's something wrong with the interface, not just a port already bound error.
Chasing the problem
-----------------
Used https://github.com/apache/incubator-brooklyn/pull/657 as a test bed to
figure out what could be wrong when the error occurs. After lost of rebuilds
(sorry if you get the jenkins emails) got lucky and hit a case wherre
`isPortAvailable` [was failing
[1]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console):
```
java.net.SocketException: No such device
at java.net.PlainSocketImpl.socketBind(Native Method) ~[na:1.6.0_45]
```
when trying to bind to IP `fe80:0:0:0:e833:4cff:fed4:1108%181`. The test
was being executed continuously for half an hour during which the error was
consistent (because `invocationCount=1000` and `nextAvailablePort` enumeration
all ports up to 65535). This shows that it's not just a momentary glitch but a
persistent state of the network stack.
Digging in
-------------------
As the message is not really helpful and it was not clear what is causing
it, I went to the Java sources to find out. As seen from the above stack trace,
Java eventually calls into the POSIX API when dealing with sockets. [This
[2]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643)
is where the exception is thrown from. It will essentially use `strerror`
after `bind` failure to build the exception message (see
[[3]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160),
[[4]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658),
[[5]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608)).
Now we are getting somewhere, "No such device" [is equivalent
[6]](https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220)
to `ENODEV` error, which is returned from `bind`. But what does it actually
mean? From the [man page on IPv6
[7]](http://man7.org/linux/man-pages/man7/ipv6.7.html):
```
ENODEV The user tried to bind(2) to a link-local IPv6 address, but
the sin6_scope_id in the supplied sockaddr_in6 structure is
not a valid interface index.
```
And indeed [this is the case
[8]](https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335).
Looking for what `sin6_scope_id` means I found out that local-link IPv6
addresses (beginning with `fe80::`) have an additional identifier to
distinguish between IPs on different interfaces. All link-local IPs of the
interfaces live in the same subnet which makes it impossible to distinguish to
which interface it belongs just by the IP. To solve this one can add the
interface index (or on some newer distributions the interface name) separated
by `%` to the IP. Looking at the failing IP, it has this identifier -
`fe80:0:0:0:e833:4cff:fed4:1108%181` - `181`, which means that it belongs to
interface with index 181, unusually high.
Now it's time to find out where does Java get the IP+index from. On one
side attempts to bind to the IP fail, but on the other Java includes it in the
interface list. We get the interface list and assigned IPs from
`NetworkInterface.getNetworkInterfaces()`. Going back to the jdk sources
eventually (and not to bog you down with even more details) we get to [this
piece of code
[9]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101).
It's parsing the contents of `/proc/net/if_inet6` to get the list. On my
system it looks like this:
```
00000000000000000000000000000001 01 80 10 80 lo
fe800000000000000a0027fffeb8efed 02 40 20 80 eth0
fe800000000000000a0027fffea3192c 03 40 20 80 eth1
```
So `/proc/net/if_inet6` contains a record for
`fe80:0:0:0:e833:4cff:fed4:1108%181`, but `bind` says that interface `181`
doesn't exist (and this goes for at least half an hour, before the test is
aborted). Weird. Could be some misconfiguration? Probably caused by some other
tests running in parallel doing something to the system.
Comparing different build runs I noticed something interesting. In [some of
the logs, i.e.
[10]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console)
`ifconfig` returned (amongst the standard stuff):
```
docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:183134 errors:0 dropped:0 overruns:0 frame:0
TX packets:324280 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:39149132 (39.1 MB) TX bytes:277258190 (277.2 MB)
...
vethc78eeb0 Link encap:Ethernet HWaddr d6:88:a9:b6:be:5d
inet6 addr: fe80::d488:a9ff:feb6:be5d/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:40283 errors:0 dropped:0 overruns:0 frame:0
TX packets:193760 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3764512 (3.7 MB) TX bytes:256855395 (256.8 MB)
```
`vethc78eeb0` looks interesting - like it's being auto generated. This
could explain the large interface index we have. This run was successful
though, so something more going on there. Time to find out what vethc78eeb0
means and who is creating it. The docker interface gives some hints - LXC has a
type of network called [`veth` (Virtual Ethernet)
[11]](https://wiki.debian.org/LXC/SimpleBridge). One can create a pair of
interface with LXC and the convention is to call them vethXXX. Apparently the
interface above has the id `c78eeb0` auto generated. So looks like some test is
creating a ton of virtual interfaces and then destroying them (or more likely
putting at least one in a strange state). There was a mesos test bing run
around the same time which supports this.
I wanted to replicate the exact same error locally to be sure that the
changes will solve the problem. Followed [this tutorial
[12]](http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/)
to create a veth pair:
```
sudo ip netns add blue
sudo ip link add veth0 type veth peer name veth1
sudo ip link set veth1 netns blue
sudo ip netns exec blue ifconfig veth1 10.1.1.1/24 up
```
Tried to give it an IPv6 address as well, but eventually it generated its
own link-local one. Binding to the new IP works great, everything in order in
`/proc/net/if_inet6`. Then I let the interface go on an up-down loop with the
following and started the `isPortAvailable` test.
```
while true; do sudo ifconfig veth0 down; sleep 5; sudo ifconfig veth0 up;
sleep 5; done
```
I started getting `java.net.BindException: Cannot assign requested address`
exceptions for a second after the interface is upped. Once it settles the test
is working, but there is a short period after up where the interface exists
with an IP, but denies binds. It's not quite what I was after, but close enough
for tests.
First tried to work around it by using `NetworkInterface.isUp` but didn't
help. That's why I settled on trying a second bind but this time letting the OS
choose a free port. If the normal bind fails and the second one succeeds, then
obviously the port is taken. But if both fail then there's something fishy
going on.
And finally one caveat taken from the commit comments:
```
* If there's port exhaustion on a single interface we won't catch it
* and declare the port is free. Doesn't matter really because the
* subsequent bind of the caller will fail anyway and nextAvailablePort
* wouldn't be able to find a free one either.
```
PS: Looks like this commit
https://github.com/pushtechnology/incubator-brooklyn/commit/34439f4a2ac0c13781a8485c4d6d308fb744de47
removed part of the check wrongly assuming it's unnecessary. Not sure if worth
adding back...
[1]
https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console
(will be gone soon)
[2]
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643
[3]
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160
[4]
http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658
[5]
http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608
[6]
https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220
[7] http://man7.org/linux/man-pages/man7/ipv6.7.html
[8] https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335
[9]
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101
[10]
https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console
[11] https://wiki.debian.org/LXC/SimpleBridge
[12]
http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/neykov/incubator-brooklyn
fix-port-not-available
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-brooklyn/pull/662.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #662
----
commit 72571f2c65bf2cca16520cc525e6b61812e51158
Author: Svetoslav Neykov <[email protected]>
Date: 2015-05-27T11:51:27Z
Fix for the "unable to find a free port" error that has been plaguing the
jenkins builds
If the free port check fails for a specific interface, do an additional
check if able to bind to an OS picked port on the same IP. If it fails, then
there's something wrong with the interface, not just a port already bound error.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---