Hi Samuele, On Mon, Apr 23, 2012 at 03:31:53PM +0200, Samuele Giovanni Tonon wrote: > > These checks are extremely long for a 302 ! On local networks, it's common > > to see "0ms" because the total check is below 1ms. > > yes i agree, i tried setting up an http mode directly on an apache > and those check were max 26 ms.
Still extremely large ! > >> then we started seeing problems: with the "old" architecture ) with > >> just iptables forwarding to one varnish ) we were running > >> webpagetest.org at 7 seconds first view 1.5 repeated view. > >> > >> with haproxy we went to 15 - 20 seconds for first view 7-8 seconds > >> repeated views. > > > > These times are quite long in my opinion. How many objects do you have > > on your page ? This looks like TCP retransmits. > > hmmm i'm not familiar with what you could mean with tcp retransmit how > could i see on the system, with just an netstat ? Yes, "netstat -s" will report this. You need to check "uptime" too when running "netstat -s" because some 32-bit counters roll over so it's not the same when you see 1-day counters and 300-day counters. > >> At the moment we are a bit stuck as we can't undestand what is wrong: > >> dns is fine, network is fine, we don't see high load average or memory > >> exhaustion... everything seems ok; btw we are using vmware vm machines, > >> we even switched to e1000 ethernet cards to avoid some problems with > >> vmxnet . > > > > Wait a minute, you're saying the most important thing at the last moment ! > > Are there other VMs on the same hypervisor ? It's very common to see huge > > network latency degradation when this is the case (this should not be as > > bad as you're observing unless your VMs are saturated of course). Also, > > when you say e1000 cards, you mean that you're using physical NICs from > > the VM or that you're using the e1000 emulation ? Also, when you do your > > tests, what does the CPU load on the vm look like, and is there additional > > traffic on this VM ? could you run "vmstat 1 20" during the test ? > > well i guess i need to give you some more information: the whole > infrastracture is under vmware; Reminds me of a total site failure a few years ago... > i don't know if they are all under the > samy hypervisor but i'm not sure the MAIN cause it's cpu context > switching due to the fact that by just redirecting the port (that is > what pulse does) we didnt see those numbers, however i'll try to see > on which hypervisor they are and try to split them to avoid cpu context > switching . OK then you absolutely need to ensure that the machines where you run your components are not overloaded. People who deploy vmware everywhere tend to consider that racking a freshnew machine is just a matter of clicking, without any consideration for shared CPU ressources. When you have more virtual CPUs than real ones, some VMs will necessarily have to wait their turn to get the CPU. The problem is how long to wait. > as for the rest, e1000 is as emulation nic on vm guest; OK, it's clearly not the fastest one, and it significantly increases CPU usage since it has to emulate the hardware. Vmxnet is much more efficient. > do you want vmstat on just the haproxy server or on the varnish too ? Yes please do on both machines. > > BTW, I don't know if you pinned your VMs to physical CPUs, but one of the > > worst things to do is to have two VMs sharing the same CPU and communicating > > together. This creates high latencies and context switching rates because > > only one at a time can work on the CPU. This should be observable in the > > network capture. > > unfortunately i'm working on a large environment, where infrastracture > is not done by me and i have no idea if they are pinned or not; i'll > try to investigate . OK. Keep in mind that in virtual environments, behaviours are very different between idle systems and loaded systems. The virtualization overhead seems very low when you're doing almost nothing, but it's common to see that it's impossible to scale on some workloads simply because of extra latencies between some guests. And unfortunately these are the hardest issues to track down because time wasted is accounted nowhere. Regards, Willy

