Re: HAProxy Slows At 1500+ connections Really Need some help to figure out why

joris dedieu Sun, 04 Oct 2015 07:31:39 -0700

Hi,
Just a few translation Linux -> FreeBSD. As pfSense is FreeBSD based.


2015-10-04 10:56 GMT+02:00 Willy Tarreau <[email protected]>:
> On Sat, Oct 03, 2015 at 12:55:33AM -0700, Daren Sefcik wrote:
>> > Is there some kernel messages
>> > Load, swap usage, disk space
>> >
>> again, according to my limited know how, top and other built in utilities
>> all report the system is barely doing anything and there is tons of memory
>> and disk space
>
> Just run "free" after a test and "vmstat 1 10" during a test.
>
>> > During stress :
>> > Is there more sys/interrupt than user cpu usage
>> > Link saturation
>> > Packet lost
>> >
>> I am not sure how to check this, I will try and figure this out but if you
>> have any advice that would be appreciated.
>> The LAN interface is a bonded interface with (3) 1000mb NIC cards so I am
>> doubtful it is being saturated from this simple apache bench test.
>> Here is what the Interfaces status shows me:
>>
>> *Status up*
>> MTU 1500
>> Media autoselect
>> LAGG Protocol lacp lagghash l2,l3,l4
>
> That's interesting. Keep in mind that different aggregation algorithms
> exist, and that hashing on l2+l3+l4 will spread different connections to
> different ports. As long as you have enough connections (which seems to
> be your case) your traffic should be evenly spread. But on low connections
> it can happen that you saturate one link without traffic on the other ones.
> So for now let's consider this not a problem.
>
>> LAGG Ports bge3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
>> bge2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
>> bge1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
>> In/out packets 248989670/305051696 (77.73 GB/88.68 GB)
>> In/out packets (pass) 248989670/305051696 (77.73 GB/88.68 GB)
>> In/out packets (block) 4130394/147 (4.75 GB/70 KB)
>> In/out errors 0/608
>> Collisions 0

sysctl net.link.lagg.lacp.debug=1 should provide some interesting informations.

Broadcom NICs : you should check man 4 bge and
https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

>>
>> Suboptimal firewall rules : replay stress packet filter unloaded.
>> >
>> There are only two simple allow firewall rules for LAN access, nothing
>> complicated at all.
>
> No but very likely you're running with conntrack. If it's not properly
> tuned you can quickly end up with a conntrack table full. Please run
> "lsmod" to see the load modules, and "dmesg | grep -i conntrack" to
> see if any such message has appeared, as well as "dmesg | grep -i drop"
> to see if the kernel complained it was forced to drop anything. The
> best thing to try to be sure is to unload all firewall modules,
> especially conntrack.

I had a look to pfsense kernel config it seems that pf, pflog, pfsync
and all netgraph and altq stuff are not build as a loadable modules
(the output of kldstat should confirm that). So you can't unload them.
This is not ideal has simply loading a module could enable some
features in the network stack. You can first test with pfctl -d to
disable pf (better to have console access to do those things). Make
also sure you don't have some QOS enable.

grep kernel  /var/log/messages to see if something is logged by kernel
(or dmesg output).

You could also check tcp states evolution during the test (with a bourn shell) :

clear; while : ; do netstat -anp tcp |awk '$6 ~ /^[A-Z]/ && $6
!~/Foreign|LISTEN/{print $6}' | sort |uniq -c |sort -g ; sleep 2 ;
clear;  done

or with csh

clear ; while ( 1 == 1 )
    netstat -anp tcp |awk '$6 ~ /^[A-Z]/ && $6
!~/Foreign|LISTEN/{print $6}' | sort |uniq -c |sort -g ; sleep 2 ;
clear
end

Best regards
Joris
>
>> I am really stumped by this problem and am hoping you guys can help me get
>> this figured out. If there are any commands I can run to get info that
>> would be helpful please let me know.
>
> In general the situation you describe is observed in a few cases :
>   - too low file descriptor limits. A non-root user is limited to 1024
>     hence about 512 end-to-end connections, but I'm assuming you started
>     haproxy as a root user to get enough connections ;
>
>   - improperly tuned firewall : this is the most common case. Each end
>     to end connection uses two conntrack entries, one from the client
>     to the proxy and one from the proxy to the server. Connections remain
>     for some time after they are closed due to the TIME_WAIT state and add
>     to the count.
>
>   - communications in virtualized environments being limited by improper
>     configuration of the hypervisor. We've got a number of reports, some
>     even public on the list here where hosting providers were unable to
>     configure their hypervisor to stand at least the load of a single VM,
>     so packets were dropped by the hypervisor.
>
>   - bogus NIC firmware. We used to face this situation for a few years
>     about 5 years ago, some NICs (netxtreme 2 found on a lot of Proliant
>     servers) were losing up to 12% of the packets, so I let you imagine
>     how TCP performed... We haven't had such a report for the last 2 years
>     so I consider this issue fixed by now.
>
> HAProxy's logs are designed to find exactly what is happening and to dig
> into the problem. So you'll have to post some logs so that we can see if
> there are connection retries, long connection times, or maybe almost no
> request received (in case it blocks upfront).
>
> Hoping this helps,
> Willy
>

Re: HAProxy Slows At 1500+ connections Really Need some help to figure out why

Reply via email to