Hi Paul, On Fri, Feb 11, 2011 at 11:53:44PM +0000, [email protected] wrote: > I have a test system consisting of Apache JMeter -> haproxy (1.4.11) -> > Apache 2.2 on a solaris 10 zone. > The issue I am facing is that, whilst at the start of a load test the > response profile looks fine, it then falls apart with long responses. > > The issue can be repeated easily, with condition marked by a few hundred > SYN_SENT lines in netstat. > The onset of the SYN_SENT is preceeded by the number of TIME_WAIT in > netstat hitting ~32000. > > Now the thing that is confusing me :- take haproxy out of the equation and > the system behaves. > So exactly the same test - which in this case results in ~18,000 TIME_WAIT > connections - but solid and consistent.
There are several possible explanations to this. The first one is that you don't have enough source ports to prevent reallocating a socket which is still in TIME_WAIT state. On solaris if my memory serves me right, you can set that with tcp_smallest_anon_port and tcp_largest_anon_port with "ndd /dev/tcp". I have memories of figures around 49152-61000. At least extending the range to 1024-65535 will let you see if it still happens, if it happens later or if it does not change anything. To be honnest, I doubt this could be the reason because you're saying that you observe SYN_SENT sockets, so that means the system managed to find a free port. The second possible explanation is that the system uses too strong randoms for its initial sequence numbers. A client may reuse the same source ip:port only if the sequence number is above the end of the previous window. Solaris has a setting "tcp_strong_iss" which lets you decide the strength of the randoms. If the random is too strong and does not consider the previous socket, the SYN will be sent from the haproxy socket to the apache socket, and the system will simply ignore it, believing it's a retransmit of the previous socket. That's quite common when you have to go through firewalls which blindly enfore randomness. This issue could happen if you have tcp_strong_iss set to 2. I don't remember if it can happen with 1, and it will definitely not at zero. One last possibility would be that the apache sockets are not correctly released (for whatever reason) and that the listen queue is full. When this happens, solaris emits in system logs something like "dropping packet, possible SYN flood". This is something which is easy to check, however fixing it could be tedious if we don't find why it happens. Regards, Willy

