On Thu, May 21, 2009 at 11:05:40AM +0100, Dan Carley wrote:
> 
> We've been playing with relayd recently - both from 4.5 and the latest
> snapshot.
> 
> Approximately every hour we are seeing one or two state changes logged. But
> I can't see reason for the change of state and there doesn't appear to be a
> pattern in the way that the hosts are failed.

We just happen to notice the same thing here.

Here's the info I could gather on this, but I suspect the
problem might not be relayd itself.

My relayd configuration is as such:

relayd.conf:
----
interval 5
log updates
timeout 3000

table <floods> {
        10.0.1.10
        10.0.2.10
        10.0.10.10
}

redirect test2 {
        listen on 10.0.1.15 port 30099
        forward to <floods> check tcp
}

redirect test {
        listen on 10.137.16.192 port 30100
        forward to <floods> check tcp
}
----

# relayctl show summary
Id      Type            Name                            Avlblty Status
1       redirect        test2                                   active
1       table           floods:30099                            active
(3 hosts)
1       host            10.0.1.10                       100.00% up
2       host            10.0.2.10                       100.00% up
3       host            10.0.10.10                      100.00% up
2       redirect        test                                    active
2       table           floods:30100                            active
(3 hosts)
4       host            10.0.1.10                       100.00% up
5       host            10.0.2.10                       100.00% up
6       host            10.0.10.10                      100.00% up


Now, at random times (1-2 / hour average), we get the following error in the
logs:

May 26 18:00:31 testfw1 relayd[25554]: host 10.0.1.10,
        check tcp (0ms), state up -> down, availability 99.92%
May 26 18:00:36 testfw1 relayd[25554]: host 10.0.1.10,
        check tcp (0ms), state down -> up, availability 99.92%

But, we can confirm that the service does not go down in reality. The
firewalls are redundant with the same relayd config, and they don't see
the service going down at the same time (they do, however, both get the
same behavior for up/down's).

Adding some debugging code in relayd, I found that connect() returns
EADDRINUSE at check_tcp.c:87. This seemed strange at first since a few
lines above the SO_REUSEPORT is set on the socket. Also, the firewalls
used to test this are almost sleeping with less than 100 sockets at a
time, mostly used by relayd performing TCP checks. So we're clearly not
running out of ephemeral ports.

Just for the sake of trying, I took the CVS source for relayd,
commented out the SO_REUSEPORT option, recompiled and restarted it.
Strangely, now the up/down's are gone. I would expect SO_REUSEPORT to
prevent EADDRINUSE errors, so I'm a bit puzzled...

Could anyone help shed light on this?

Thanks,
-- 
Pascal

Reply via email to