Thats good to know, while 2000 concurrent connections what we do right
now, it will be closer to 10,000 concurrent connections come the
holiday season which is closer to 2.5 GB of ram (still less then whats
on the server).

One though I have is our requests can be very large at times (big
headers, super huge cookies), it may not be packet loss that the
bigger buffer is fixing but a  better ability to buffer our large
requests. Which might explain why nginx wasn't showing this issue
where as haproxy was.

We don't have any HP Servers or Broadcom NICs (all Intel). I too have
had a lot of issues in general with both HP and Broadcom and choose
hardware for our LB that didn't have those nics.

Our switches are new, but not super high quality (netgears) its
possible they are not performing as well as we would like, ill have to
do some more tests on them.

I'm working on creating a more production like lab where I can test a
number of different aspects of the LB to see what else I can do in
terms of performance. I will make lots of use of halog -srv along with
other tools to measure performance and to see if I can crackdown any
issues in our current H/W setup.

Thanks for all the help,

Matt C

On Thu, Jun 9, 2011 at 10:20 PM, Willy Tarreau <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 04:04:26PM -0700, Matt Christiansen wrote:
>> I added in the tun.bufsize 65536 and right away things got better, I
>> doubled that to 131072 and all of the outliers went way. Set at that
>> with my tests it looks like haproxy is faster then nginx on 95% of
>> responses and on par with nginx for the last 5% which is fine with me
>> =).
>
> Nice, at least we have a good indication of what may be wrong. I'm
> pretty sure you're having an important packet loss rate.
>
>> What is the negative to setting this high like that? If its just ram
>> usage all of our LBs have 16GB of ram (don't ask why) so if thats all
>> I don't think it will be an issue having that so high.
>
> Yes it's just an impact on RAM. There are two buffers per connection,
> so each connection consumes 256kB of RAM in your case. If you do that
> times 2000 concurrent connections, that's 512MB, which is still small
> compared to what is present in the machine :-)
>
> However, you should *really* try to spot what is causing the issue,
> because right now you're just hiding it under the carpet, and it's not
> completely hidden as retransmits still take some time to be sent.
>
> Many people have encountered the same problem with Broadcom NetXtreme2
> network cards, which was particularly marked on those shipped with a
> lot of HP machines (firmware 1.9.6). The issue was a huge Tx drop rate
> (which is not reported in netstat). A tcpdump on the machine and another
> one on the next hop can show that some outgoing packets never reach their
> destination.
>
> It is also possible that one equipment is dying (eg: a switch port) and
> that the issue will get worse with time.
>
> You should pass "halog -srv" on your logs which exhibit the varying
> times. It will output the average connection times and response times
> per server. If you see that all servers are affected, you'll conclude
> that the issue is closer to haproxy. If you see that just a group of
> servers is affected, you'll conclude that the issue only lies around
> them (maybe you'll identify a few older servers too).
>
> Regards,
> Willy
>
>

Reply via email to