Daniel Hartmeier wrote:
I'd first make sure it's not CARP related (i.e. all packets always pass
through one box), by (temporarily) turning off the backup box. If, for
some reason, packets would flow through both boxes (some through the
master, some through the backup), things would break in funny ways.
Done. The simplest way to do this is to down the two carp interfaces on the slave box. This makes the slave non-existant
in the carp world as far as I can see. Correct me if I'm wrong.
Now that everything must pass through the master, enable debug logging
(pfctl -xm), note counters (pfctl -si), and reproduce the problem once.
If you can, tcpdump one faulty connection (from the initial SYN to where
the problem shows) on all relevant interfaces (two, I assume).
Check /var/log/messages for lines from pf, especially "BAD state". Note
updated counters (pfctl -si again), and diff old vs. new. Which counters
are increasing?
I did get a lot of "/bsd: pf: BAD state:" lines in the log from this, but each line includes the endpoints for the state
and none of these "BAD state" lines contain even one of the IPs in a faulty connection, so it seems to me that no "BAD
state" is logged when this disconnection occurs.
As to which counters are increasing, I see this now:
State Table Total Rate
current entries 24918
searches 318334943 12907.4/s
inserts 2353683 95.4/s
removals 2347284 95.2/s
Counters
match 189879481 7699.0/s
bad-offset 0 0.0/s
fragment 7 0.0/s
short 2 0.0/s
normalize 79 0.0/s
memory 0 0.0/s
bad-timestamp 0 0.0/s
congestion 91183 3.7/s
ip-option 0 0.0/s
proto-cksum 5816 0.2/s
state-mismatch 108359 4.4/s
state-insert 0 0.0/s
state-limit 0 0.0/s
src-limit 0 0.0/s
synproxy 0 0.0/s
I actually seems to me that all 'active' counters are increasing, except perhaps the total number of states which is
slightly lower.
In your previous tcpdump, the client starts to use SACK after one packet
from the server is lost. Maybe that is what distinguishes the clients
(some use SACK, some don't). You could confirm this theory by
(temporarily) disabling SACK on the server (net.inet.tcp.sack=0 on
OpenBSD).
The problem did not change for better or worse by turning net.inet.tcp.sack off
so I turned it back on.
--
Per Gøtterup <[EMAIL PROTECTED]> · Systems Administrator & Support
WebHotel.net · INFORCE A/S · Sydvestvej 100 · DK-2600 Glostrup · Denmark
Phone: +45 70232490 · Fax: +45 70232480 · Web: www.webhotel.net