Hi Paul,

On Fri, Feb 11, 2011 at 11:53:44PM +0000, [email protected] wrote:
> I have a test system consisting of Apache JMeter -> haproxy (1.4.11) ->
> Apache 2.2 on a solaris 10 zone.
> The issue I am facing is that, whilst at the start of a load test the
> response profile looks fine, it then falls apart with long responses.
> 
> The issue can be repeated easily, with condition marked by a few hundred
> SYN_SENT lines in netstat.
> The onset of the SYN_SENT is preceeded by the number of TIME_WAIT in
> netstat hitting ~32000.
> 
> Now the thing that is confusing me :- take haproxy out of the equation and
> the system behaves.
> So exactly the same test - which in this case results in ~18,000 TIME_WAIT
> connections - but solid and consistent.

There are several possible explanations to this.

The first one is that you don't have enough source ports to prevent
reallocating a socket which is still in TIME_WAIT state. On solaris
if my memory serves me right, you can set that with tcp_smallest_anon_port
and tcp_largest_anon_port with "ndd /dev/tcp". I have memories of figures
around 49152-61000. At least extending the range to 1024-65535 will let
you see if it still happens, if it happens later or if it does not change
anything. To be honnest, I doubt this could be the reason because you're
saying that you observe SYN_SENT sockets, so that means the system managed
to find a free port.

The second possible explanation is that the system uses too strong randoms
for its initial sequence numbers. A client may reuse the same source ip:port
only if the sequence number is above the end of the previous window. Solaris
has a setting "tcp_strong_iss" which lets you decide the strength of the
randoms. If the random is too strong and does not consider the previous
socket, the SYN will be sent from the haproxy socket to the apache socket,
and the system will simply ignore it, believing it's a retransmit of the
previous socket. That's quite common when you have to go through firewalls
which blindly enfore randomness. This issue could happen if you have
tcp_strong_iss set to 2. I don't remember if it can happen with 1, and it
will definitely not at zero.

One last possibility would be that the apache sockets are not correctly
released (for whatever reason) and that the listen queue is full. When
this happens, solaris emits in system logs something like "dropping packet,
possible SYN flood". This is something which is easy to check, however
fixing it could be tedious if we don't find why it happens.

Regards,
Willy


Reply via email to