Hey Willy,

On 2019-07-09 08:09, Willy Tarreau wrote:
What's you CPU like between the peaks ? 1%, 10%, 50% ? Just to get a rough
estimate of whether it's something reaching a critical point or if it's
something doing its mess alone in its corner.

In between the spikes it's about 7% System, 11% User, 6% Softirq, 76% Idle. Bandwidth is then about 500Mbit/s, mostly outbound.

What I didn't notice before, but just saw while staring at my graphs, is I get more incoming traffic during the CPU spikes. So, I'm doing about 500Mbit/s, then the incoming traffic rises to about 100Mbit/s (probably a HTTP POST), CPU spikes, total traffic drops to about 200Mbit/s, everything starts getting slow.

I had HAProxy running on physical hardware with an E5-2407 and 1Gbit NIC. Now it is running as a VM on an E5-2650 with 10Gbit NIC. With the same issues.

Are you using threads ? I'm asking because I'm currently working on an
issue which I found could cause exactly this behaviour. I'm fairly certain
we've met it in the past without being able to attribute it to exactly
this.

Yes, I'm using threads.

If you're using threads, attaching gdb to the process and issuing "info
threads" will tell us where they are. If many of them are in
fd_update_events() or fd_may_recv(), you're likely on the one I've been
working on.

Other possibilities (due to the regularity of your observation) are :
  - timeouts (check in your conf if a 10s timeout appears somewhere,
    maybe it triggers and is improperly caught)

I have the following timeouts in defaults:
        timeout client          60s
        timeout connect         10s
        timeout http-keep-alive 4s
        timeout http-request    15s
        timeout queue           30s
        timeout server          60s
        timeout tarpit          120s

Looking at the spikes again it's more like a 20 second up, 20 second down. But that probably has more to do with the POST taking that long.

  - health checks (maybe you have 10s checks, or 2s checks with 4
    retries or I don't know what, which causes a special event to
    occur after 10s)

Check are every 2s with a rise of 3 and a fall of 3.

In any case you're clearly facing a bug, but it's always difficult to
tell.

It could be useful to issue "show activity" twice 1 second apart when
this happens, and maybe even "show fd" and "show sess all" if you don't
have too many connections.

Right, I will do the above steps. But, since this only happens on Mondays we have to wait a bit ;-)

Regards,

Sander

Attachment: 0x2E78FBE8.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to