Re: carp flapping

Stuart Henderson Fri, 12 May 2023 05:18:55 -0700

On 2023-05-12, Nick Holland <n...@holland-consulting.net> wrote:
> On 5/12/23 03:28, Stuart Henderson wrote:
>> On 2023-05-12, Nick Holland <n...@holland-consulting.net> wrote:
>>> Here's the problem I've seen:  I have my two machines flipping state
>>> randomly(?).  This bothers me because that means it is breaking  people's
>>> downloads.  Longest period betweek flips was less than two weeks.
>>>
>>> So ... I cranked up the carp logging to 5 and then 7 to see what it had
>>> to say about why...and it had almost nothing to say.
>> 
>> Does netstat -s -p carp give any enlightenment?
>
>
> ok, I just skewed the stats by taking the opportunity to bring the now
> backup up to -current, so node1 does not have the most recent flap:
>
> node1 $ uptime
>   7:18AM  up  8:22, 1 user, load averages: 0.00, 0.05, 0.08
>
> node1 $ doas netstat -s -p carp
> carp:
>          29981 packets received (IPv4)
>          0 packets received (IPv6)
>                  0 packets discarded for bad interface
>                  0 packets discarded for wrong TTL
>                  0 packets shorter than header
>                  0 discarded for bad checksums
>                  0 discarded packets with a bad version
>                  0 discarded because packet too short
>                  0 discarded for bad authentication
>                  0 discarded for unknown vhid
>                  0 discarded because of a bad address list
>          0 packets sent (IPv4)
>          0 packets sent (IPv6)
>                  0 send failed due to mbuf memory error
>          0 transitions to master
>
>   node2 $ uptime
>   7:19AM  up 4 days, 20:58, 2 users, load averages: 0.83, 0.78, 0.73
>
> $ ] netstat -s -p carp
> carp:
>          367836 packets received (IPv4)
>          0 packets received (IPv6)
>                  0 packets discarded for bad interface
>                  0 packets discarded for wrong TTL
>                  0 packets shorter than header
>                  0 discarded for bad checksums
>                  0 discarded packets with a bad version
>                  0 discarded because packet too short
>                  0 discarded for bad authentication
>                  0 discarded for unknown vhid
>                  0 discarded because of a bad address list
>          52806 packets sent (IPv4)
>          0 packets sent (IPv6)
>                  0 send failed due to mbuf memory error
>          2 transitions to master
>
>
> Will monitor going forward, though.
>
>
> I had several other people suggest network problems.  I'm not going to
> say "impossible" or even "unlikely", but my understanding is that the
> two machines are both plugged into the same switch, in the same rack.


You can also look at

netstat -ni -I ixl0
netstat -ni -I ixl0 -e
kstat ixl0:::

which may give some other clues

even pfctl -si might have something relevant

> Several people pointed out I was using the default advskew of 1 second,
> which means a small network glitch (or system load?  maybe I'm all wrong
> about this system never breaking a sweat, at least when it comes to
> network traffic) would flip it, so I've increased it to 10 on both
> machines (and apparently just induced a flip of my own. oops).  By the
> nature of this system, some people will be annoyed by any flip, so it
> really doesn't matter if it was a 1 second outage or a 30 second outage,
> I just want the system available again after an unhappy event (or
> routine maintenance).

the course adjustment in seconds is advbase, advskew is a much smaller
delay meant for a config with primary/backup where the backup advertises
just slightly less frequently.

Re: carp flapping

Reply via email to