GitHub user neykov opened a pull request:

    https://github.com/apache/incubator-brooklyn/pull/662

    Fix for the "unable to find a free port" Jenkins error

    The fix
    ----------
    If the free port check fails for a specific interface, do an additional 
check if able to bind to an OS picked port on the same IP. If it fails, then 
there's something wrong with the interface, not just a port already bound error.
    
    Chasing the problem
    -----------------
    Used https://github.com/apache/incubator-brooklyn/pull/657 as a test bed to 
figure out what could be wrong when the error occurs. After lost of rebuilds 
(sorry if you get the jenkins emails) got lucky and hit a case wherre 
`isPortAvailable` [was failing 
[1]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console):
    ```
    java.net.SocketException: No such device
        at java.net.PlainSocketImpl.socketBind(Native Method) ~[na:1.6.0_45]
    ```
    when trying to bind to IP `fe80:0:0:0:e833:4cff:fed4:1108%181`.  The test 
was being executed continuously for half an hour during which the error was 
consistent (because `invocationCount=1000` and `nextAvailablePort` enumeration 
all ports up to 65535). This shows that it's not just a momentary glitch but a 
persistent state of the network stack.
    
    Digging in
    -------------------
    As the message is not really helpful and it was not clear what is causing 
it, I went to the Java sources to find out. As seen from the above stack trace, 
Java eventually calls into the POSIX API when dealing with sockets. [This 
[2]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643)
 is where the exception is thrown from. It will essentially use `strerror` 
after `bind` failure to build the exception message (see 
[[3]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160),
 
[[4]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658),
  
[[5]](http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608)).
    
    Now we are getting somewhere, "No such device" [is equivalent 
[6]](https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220)
 to `ENODEV` error, which is returned from `bind`. But what does it actually 
mean? From the [man page on IPv6 
[7]](http://man7.org/linux/man-pages/man7/ipv6.7.html):
    ```
           ENODEV The user tried to bind(2) to a link-local IPv6 address, but
                  the sin6_scope_id in the supplied sockaddr_in6 structure is
                  not a valid interface index.
    ```
    And indeed [this is the case 
[8]](https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335). 
Looking for what `sin6_scope_id` means I found out that local-link IPv6 
addresses (beginning with `fe80::`) have an additional identifier to 
distinguish between IPs on different interfaces. All link-local IPs of the 
interfaces live in the same subnet which makes it impossible to distinguish to 
which interface it belongs just by the IP. To solve this one can add the 
interface index (or on some newer distributions the interface name) separated 
by `%` to the IP. Looking at the failing IP, it has this identifier - 
`fe80:0:0:0:e833:4cff:fed4:1108%181` - `181`, which means that it belongs to 
interface with index 181, unusually high.
    
    Now it's time to find out where does Java get the IP+index from. On one 
side attempts to bind to the IP fail, but on the other Java includes it in the 
interface list. We get the interface list and assigned IPs from 
`NetworkInterface.getNetworkInterfaces()`. Going back to the jdk sources 
eventually (and not to bog you down with even more details) we get to [this 
piece of code 
[9]](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101).
 It's parsing the contents of `/proc/net/if_inet6` to get the list. On my 
system it looks like this:
    ```
    00000000000000000000000000000001 01 80 10 80       lo
    fe800000000000000a0027fffeb8efed 02 40 20 80     eth0
    fe800000000000000a0027fffea3192c 03 40 20 80     eth1
    ```
    
    So `/proc/net/if_inet6` contains a record for 
`fe80:0:0:0:e833:4cff:fed4:1108%181`, but `bind` says that interface `181` 
doesn't exist (and this goes for at least half an hour, before the test is 
aborted). Weird. Could be some misconfiguration? Probably caused by some other 
tests running in parallel doing something to the system.
    
    Comparing different build runs I noticed something interesting. In [some of 
the logs, i.e. 
[10]](https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console)
 `ifconfig` returned (amongst the standard stuff):
    
    ```
    docker0   Link encap:Ethernet  HWaddr 56:84:7a:fe:97:99  
              inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
              inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:183134 errors:0 dropped:0 overruns:0 frame:0
              TX packets:324280 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:39149132 (39.1 MB)  TX bytes:277258190 (277.2 MB)
    
    ...
    
    vethc78eeb0 Link encap:Ethernet  HWaddr d6:88:a9:b6:be:5d  
              inet6 addr: fe80::d488:a9ff:feb6:be5d/64 Scope:Link
              UP BROADCAST RUNNING  MTU:1500  Metric:1
              RX packets:40283 errors:0 dropped:0 overruns:0 frame:0
              TX packets:193760 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:3764512 (3.7 MB)  TX bytes:256855395 (256.8 MB)
    ```
    
    
    `vethc78eeb0` looks interesting - like it's being auto generated. This 
could explain the large interface index we have. This run was successful 
though, so something more going on there. Time to find out what vethc78eeb0 
means and who is creating it. The docker interface gives some hints - LXC has a 
type of network called [`veth` (Virtual Ethernet) 
[11]](https://wiki.debian.org/LXC/SimpleBridge). One can create a pair of 
interface with LXC and the convention is to call them vethXXX. Apparently the 
interface above has the id `c78eeb0` auto generated. So looks like some test is 
creating a ton of virtual interfaces and then destroying them (or more likely 
putting at least one in a strange state). There was a mesos test bing run 
around the same time which supports this.
    
    I wanted to replicate the exact same error locally to be sure that the 
changes will solve the problem. Followed [this tutorial 
[12]](http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/)
  to create a veth pair:
    ```
    sudo ip netns add blue
    sudo ip link add veth0 type veth peer name veth1
    sudo ip link set veth1 netns blue
    sudo ip netns exec blue ifconfig veth1 10.1.1.1/24 up
    ```
    
    Tried to give it an IPv6 address as well, but eventually it generated its 
own link-local one. Binding to the new IP works great, everything in order in 
`/proc/net/if_inet6`. Then I let the interface go on an up-down loop with the 
following and started the `isPortAvailable` test. 
    ```
    while true; do sudo ifconfig veth0 down; sleep 5; sudo ifconfig veth0 up; 
sleep 5; done
    ```
    
    I started getting `java.net.BindException: Cannot assign requested address` 
exceptions for a second after the interface is upped. Once it settles the test 
is working, but there is a short period after up where the interface exists 
with an IP, but denies binds. It's not quite what I was after, but close enough 
for tests.
    
    First tried to work around it by using `NetworkInterface.isUp` but didn't 
help. That's why I settled on trying a second bind but this time letting the OS 
choose a free port. If the normal bind fails and the second one succeeds, then 
obviously the port is taken. But if both fail then there's something fishy 
going on.
    
    And finally one caveat taken from the commit comments:
    ```
         * If there's port exhaustion on a single interface we won't catch it
         * and declare the port is free. Doesn't matter really because the
         * subsequent bind of the caller will fail anyway and nextAvailablePort
         * wouldn't be able to find a free one either.
    ```
    
    
    PS: Looks like this commit 
https://github.com/pushtechnology/incubator-brooklyn/commit/34439f4a2ac0c13781a8485c4d6d308fb744de47
 removed part of the check wrongly assuming it's unnecessary. Not sure if worth 
adding back...
    
    
    [1] 
https://builds.apache.org/job/incubator-brooklyn-pull-requests/1238/console 
(will be gone soon)
    [2] 
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/PlainSocketImpl.c#l643
    [3] 
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/share/native/common/jni_util.c#l160
    [4] 
http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/share/vm/prims/jvm.cpp#l658
    [5] 
http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/1f963de41717/src/os/linux/vm/os_linux.cpp#l1608
    [6] 
https://github.com/lattera/glibc/blob/a7a93d5086103f52367d3e9776976eb0b0bc6c7b/sysdeps/gnu/errlist.c#L220
    [7] http://man7.org/linux/man-pages/man7/ipv6.7.html
    [8] https://github.com/torvalds/linux/blob/master/net/ipv6/af_inet6.c#L335
    [9] 
http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/ca2c9f498b70/src/solaris/native/java/net/NetworkInterface.c#l1101
    [10] 
https://builds.apache.org/job/incubator-brooklyn-pull-requests/1247/console
    [11] https://wiki.debian.org/LXC/SimpleBridge
    [12] 
http://blog.scottlowe.org/2013/09/04/introducing-linux-network-namespaces/

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/neykov/incubator-brooklyn 
fix-port-not-available

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-brooklyn/pull/662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #662
    
----
commit 72571f2c65bf2cca16520cc525e6b61812e51158
Author: Svetoslav Neykov <[email protected]>
Date:   2015-05-27T11:51:27Z

    Fix for the "unable to find a free port" error that has been plaguing the 
jenkins builds
    
    If the free port check fails for a specific interface, do an additional 
check if able to bind to an OS picked port on the same IP. If it fails, then 
there's something wrong with the interface, not just a port already bound error.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to