On 2023-05-12, Nick Holland <n...@holland-consulting.net> wrote: > On 5/12/23 03:28, Stuart Henderson wrote: >> On 2023-05-12, Nick Holland <n...@holland-consulting.net> wrote: >>> Here's the problem I've seen: I have my two machines flipping state >>> randomly(?). This bothers me because that means it is breaking people's >>> downloads. Longest period betweek flips was less than two weeks. >>> >>> So ... I cranked up the carp logging to 5 and then 7 to see what it had >>> to say about why...and it had almost nothing to say. >> >> Does netstat -s -p carp give any enlightenment? > > > ok, I just skewed the stats by taking the opportunity to bring the now > backup up to -current, so node1 does not have the most recent flap: > > node1 $ uptime > 7:18AM up 8:22, 1 user, load averages: 0.00, 0.05, 0.08 > > node1 $ doas netstat -s -p carp > carp: > 29981 packets received (IPv4) > 0 packets received (IPv6) > 0 packets discarded for bad interface > 0 packets discarded for wrong TTL > 0 packets shorter than header > 0 discarded for bad checksums > 0 discarded packets with a bad version > 0 discarded because packet too short > 0 discarded for bad authentication > 0 discarded for unknown vhid > 0 discarded because of a bad address list > 0 packets sent (IPv4) > 0 packets sent (IPv6) > 0 send failed due to mbuf memory error > 0 transitions to master > > node2 $ uptime > 7:19AM up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73 > > $ ] netstat -s -p carp > carp: > 367836 packets received (IPv4) > 0 packets received (IPv6) > 0 packets discarded for bad interface > 0 packets discarded for wrong TTL > 0 packets shorter than header > 0 discarded for bad checksums > 0 discarded packets with a bad version > 0 discarded because packet too short > 0 discarded for bad authentication > 0 discarded for unknown vhid > 0 discarded because of a bad address list > 52806 packets sent (IPv4) > 0 packets sent (IPv6) > 0 send failed due to mbuf memory error > 2 transitions to master > > > Will monitor going forward, though. > > > I had several other people suggest network problems. I'm not going to > say "impossible" or even "unlikely", but my understanding is that the > two machines are both plugged into the same switch, in the same rack.
You can also look at netstat -ni -I ixl0 netstat -ni -I ixl0 -e kstat ixl0::: which may give some other clues even pfctl -si might have something relevant > Several people pointed out I was using the default advskew of 1 second, > which means a small network glitch (or system load? maybe I'm all wrong > about this system never breaking a sweat, at least when it comes to > network traffic) would flip it, so I've increased it to 10 on both > machines (and apparently just induced a flip of my own. oops). By the > nature of this system, some people will be annoyed by any flip, so it > really doesn't matter if it was a 1 second outage or a 30 second outage, > I just want the system available again after an unhappy event (or > routine maintenance). the course adjustment in seconds is advbase, advskew is a much smaller delay meant for a config with primary/backup where the backup advertises just slightly less frequently.