Daniel Hartmeier wrote:
I'd first make sure it's not CARP related (i.e. all packets always pass
through one box), by (temporarily) turning off the backup box. If, for
some reason, packets would flow through both boxes (some through the
master, some through the backup), things would break in funny ways.

Done. The simplest way to do this is to down the two carp interfaces on the slave box. This makes the slave non-existant in the carp world as far as I can see. Correct me if I'm wrong.

Now that everything must pass through the master, enable debug logging
(pfctl -xm), note counters (pfctl -si), and reproduce the problem once.
If you can, tcpdump one faulty connection (from the initial SYN to where
the problem shows) on all relevant interfaces (two, I assume).

Check /var/log/messages for lines from pf, especially "BAD state". Note
updated counters (pfctl -si again), and diff old vs. new. Which counters
are increasing?

I did get a lot of "/bsd: pf: BAD state:" lines in the log from this, but each line includes the endpoints for the state and none of these "BAD state" lines contain even one of the IPs in a faulty connection, so it seems to me that no "BAD state" is logged when this disconnection occurs.

As to which counters are increasing, I see this now:

State Table                          Total             Rate
  current entries                    24918
  searches                       318334943        12907.4/s
  inserts                          2353683           95.4/s
  removals                         2347284           95.2/s
Counters
  match                          189879481         7699.0/s
  bad-offset                             0            0.0/s
  fragment                               7            0.0/s
  short                                  2            0.0/s
  normalize                             79            0.0/s
  memory                                 0            0.0/s
  bad-timestamp                          0            0.0/s
  congestion                         91183            3.7/s
  ip-option                              0            0.0/s
  proto-cksum                         5816            0.2/s
  state-mismatch                    108359            4.4/s
  state-insert                           0            0.0/s
  state-limit                            0            0.0/s
  src-limit                              0            0.0/s
  synproxy                               0            0.0/s

I actually seems to me that all 'active' counters are increasing, except perhaps the total number of states which is slightly lower.

In your previous tcpdump, the client starts to use SACK after one packet
from the server is lost. Maybe that is what distinguishes the clients
(some use SACK, some don't). You could confirm this theory by
(temporarily) disabling SACK on the server (net.inet.tcp.sack=0 on
OpenBSD).

The problem did not change for better or worse by turning net.inet.tcp.sack off 
so I turned it back on.

--
Per Gøtterup <[EMAIL PROTECTED]> · Systems Administrator & Support
WebHotel.net · INFORCE A/S · Sydvestvej 100 · DK-2600 Glostrup · Denmark
Phone: +45 70232490 · Fax: +45 70232480 · Web: www.webhotel.net

Reply via email to