Hi All

Apologies for the delay in response, i've been out of the country for the
last week

Mark, my gut feeling is that is network related in someway, so thought we
could compare the networking setup of our systems

You mentioned you see the hang across geo locations, so i assume there
isn't layer 2 connectivity between all of the hosts? is there any back end
connectivity between the haproxy hosts?

Ours are all layer 2 but are fairly complex. We have 6 connected NIC's
which are bonded into 3 LACP groups. over the top of the LACP we have a
number of VLAN interfaces. we also have a couple of normal IP aliases and a
number of CARP IP's on top of that

One commonality is NTP as they all sync from our own upstream NTP services,
but having looked through the logs, there isn't a recent NTP update when
the hang occurs and i can't see any time jump

other things which are set up on the host:
local rsyslog which sends logs to centralised host
we have crons every minute for each jail (4 jails) to monitor the health of
the haproxy service
we have crons every minute for each jail (4 jails) to gather stats from
haproxy using haproxy stats frontend
we run pf on the host
Chef runs every 30 mins, and these times are splayed

does anything match up on these which could cause these issues?

Thanks

Dave



On 6 March 2017 at 20:28, Mark S <[email protected]> wrote:

> On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau <[email protected]> wrote:
>
> OK so that means that haproxy could have hung in a day or two, then your
>> case is much more common than one of the other reports. If your fdront LB
>> is fair between the 6 servers, that could be related to a total number of
>> requests or connections or something like this.
>>
>
> Another relevant point is that these servers are tied together using
> upstream, GeoIP-based DNS load balancing.  So the request rate across
> servers varies quite a bit depending on the location.  This would make a
> synchronized failure based on total requests less likely.
>
> I'm thinking about other things :
>>   - if you're doing a lot of SSL we could imagine an issue with random
>>     generation using /dev/random instead of /dev/urandom. I've met this
>>     issue a long time ago on some apache servers where all the entropy
>>     was progressively consumed until it was not possible anymore to get
>>     a connection.
>>
>
> I'll set up a script to capture the netstat and other info prior to
> reloading should this issue re-occur.
>
> As for SSL, yes, we do a fair bit of SSL ( about 30% of total request
> count ) and HAProxy does the TLS termination and then hands off via TCP
> proxy.
>
> Best,
> -=Mark S.
>

Reply via email to