we're using HAProxy in multiple different scenarios and recently upgraded
from 1.8 to 2.0 (2.0.3 atm) to benefit from all it's nice improvements.
However, we've also noticed some strange behaviour we don't understand yet.
After upgrading, we see an increased number of "Connection reset by peer"
error logs on the application servers (Java 8 applications) that HAProxy is
sending the traffic to.

We're currently running four instances of HAProxy in parallel, each on it's
own VM. Configuration is identical on all of these. There is one TLS
frontend, also using client certificates. Clients are not browsers, but
network devices sending telemetry data at fixed intervals over Thrift in
HTTP, so client behaviour is very uniform and predictable. Clients try to
use keep-alive, to keep the overhead at a minimum. Spread over all
instances, we serve around 1700 req/s.

After upgrading from 1.8.20 to 2.0.2, we saw an increase of "Connection
reset by peer" errors on the application servers, going up from around 12
errors per hour to around 300 per hour, but couldn't see any changes in
HAProxy's metrics we scrape and store from the stats socket using the
standalone Prometheus exporter and Prometheus.

We captured and analyzed a tcpdump from one of the application servers and
noticed TCP RSTs often (but not always) being sent from HAProxy to the
backend server in response to the servers FIN ACK at the end of the
response (we're using option http-server-close). However, with the RST
package, HAProxy still ACKs the last seq of the response, so it looks like
nothing is actually lost. On connections where HAProxy was initiating the
connection close, everything looked fine.

In an attempt to mitigate this, we tried changing to option
http-pretend-keepalive. This, however, lead to all servers reaching their
maxconn, HAProxy starting to queue and all instances capping at 100% CPU.

A day later, HAProxy 2.0.3 was released, and we upgraded to it, thinking
"mux-h1: Trim excess server data ..." might have something to do with that,
but it hasn't changed anything on that front actually. We haven't tried
using option http-pretend-keepalive again though.

We also have another scenario with automated clients where the clients do
not use keep-alive. We're seeing the same "connection reset" increase in
this one. However, in that scenario, changing to option
http-pretend-keepalive didn't change anything (as in: not triggering the
failure described above, but also having no impact on connection resets).

I've attached the config from our test environment, which is identical to
our production config, except for hostnames and IP addresses.
CONTROL_BACKEND is the high volume backend, at about 1500 req/s,
MONITORING_BACKEND is about 200 req/s (sum over all 4 instances).

If you have any ideas how we could tackle this, where we could dig further,
or see anything wrong in our config, please let me know.

Kind regards,
  Julian Poschmann

(See attached file: haproxy.cfg)

LANCOM Systems GmbH
Adenauerstr. 20 / B2
52146 Würselen

Tel: +49 2405 49936-0
Fax: +49 2405 49936-99

Web: https://www.lancom-systems.de

Geschäftsführer: Ralf Koenzen, Stefan Herrlich
Sitz der Gesellschaft: Aachen, Amtsgericht Aachen, HRB 16976

Attachment: haproxy.cfg
Description: Binary data

Reply via email to