On Thu, May 21, 2009 at 11:05:40AM +0100, Dan Carley wrote: > > We've been playing with relayd recently - both from 4.5 and the latest > snapshot. > > Approximately every hour we are seeing one or two state changes logged. But > I can't see reason for the change of state and there doesn't appear to be a > pattern in the way that the hosts are failed.
We just happen to notice the same thing here. Here's the info I could gather on this, but I suspect the problem might not be relayd itself. My relayd configuration is as such: relayd.conf: ---- interval 5 log updates timeout 3000 table <floods> { 10.0.1.10 10.0.2.10 10.0.10.10 } redirect test2 { listen on 10.0.1.15 port 30099 forward to <floods> check tcp } redirect test { listen on 10.137.16.192 port 30100 forward to <floods> check tcp } ---- # relayctl show summary Id Type Name Avlblty Status 1 redirect test2 active 1 table floods:30099 active (3 hosts) 1 host 10.0.1.10 100.00% up 2 host 10.0.2.10 100.00% up 3 host 10.0.10.10 100.00% up 2 redirect test active 2 table floods:30100 active (3 hosts) 4 host 10.0.1.10 100.00% up 5 host 10.0.2.10 100.00% up 6 host 10.0.10.10 100.00% up Now, at random times (1-2 / hour average), we get the following error in the logs: May 26 18:00:31 testfw1 relayd[25554]: host 10.0.1.10, check tcp (0ms), state up -> down, availability 99.92% May 26 18:00:36 testfw1 relayd[25554]: host 10.0.1.10, check tcp (0ms), state down -> up, availability 99.92% But, we can confirm that the service does not go down in reality. The firewalls are redundant with the same relayd config, and they don't see the service going down at the same time (they do, however, both get the same behavior for up/down's). Adding some debugging code in relayd, I found that connect() returns EADDRINUSE at check_tcp.c:87. This seemed strange at first since a few lines above the SO_REUSEPORT is set on the socket. Also, the firewalls used to test this are almost sleeping with less than 100 sockets at a time, mostly used by relayd performing TCP checks. So we're clearly not running out of ephemeral ports. Just for the sake of trying, I took the CVS source for relayd, commented out the SO_REUSEPORT option, recompiled and restarted it. Strangely, now the up/down's are gone. I would expect SO_REUSEPORT to prevent EADDRINUSE errors, so I'm a bit puzzled... Could anyone help shed light on this? Thanks, -- Pascal