On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber <d...@skunkwerks.at> wrote:

On Mon, 13 Mar 2017, at 13:31, David King wrote:
Hi All

Apologies for the delay in response, i've been out of the country for the
last week

Mark, my gut feeling is that is network related in someway, so thought we
could compare the networking setup of our systems

You mentioned you see the hang across geo locations, so i assume there
isn't layer 2 connectivity between all of the hosts? is there any back
end
connectivity between the haproxy hosts?

Following up on this, some interesting points but nothing useful.

- Mark & I see the hang at almost exactly the same time on the same day:
2017-02-27T14:36Z give or take a minute either way

- I see the hang in 3 different regions using 2 different hosting
providers on both clustered and non-clustered services, but all on
FreeBSD 11.0R amd64. There is some dependency between these systems but
nothing unusual (logging backends, reverse proxied services etc).

- our servers don't have a specific workload that would allow them all
to run out of some internal resource at the same time, as their reboot
and patch cycles are reasonably different - typically a few days elapse
between first patches and last reboots unless its deemed high risk

- our networking setup is not complex but typical FreeBSD:
    - LACP bonded Gbit igb(4) NICs
    - CARP failover for both ipv4 & ipv6 addresses
    - either direct to haproxy for http & TLS traffic, or via spiped to
    decrypt intra-server traffic
    - haproxy directs traffic into jailed services
- our overall load and throughput is low but consistent
- pf firewall
- rsyslog for logging, along with riemann and graphite for metrics
- all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
- haproxy 1.6.10 + libressl at the time

As I'm not one for conspiracy theories or weird coincidences, somebody
port scanning the internet with an Unexpectedly Evil Packet Combo seems
the most plausible explanation.  I cannot find an alternative that would
fit the scenario of 3 different organisations with geographically
distributed equipment and unconnected services reporting an unusual
interruption on the same day and almost the same time.

Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
libressl and seen no recurrence, just like the last 8+ months or so
since first deploying haproxy on FreeBSD instead of debian & nginx.

If the issue recurs I plan to run a small cyclic traffic capture with
tcpdump and wait for a re-repeat, see
https://superuser.com/questions/286062/practical-tcpdump-examples

Let me know if I can help or clarify further.

A+
Dave

Hi Dave,

Thanks for keeping this thread going. As for the initial report with all servers hanging, I too run NTP (actually OpenNTPd), and these only speak to in-house stratum-2 servers.

As a follow-up to my initial report, I upgraded to 1.7.3 shortly thereafter.

I've had one re-occurrence of this "hang" but this time, it did not affect all of my servers, instead, it affected only 2 (the busier ones). If the theory about some timing event ( leap second, counter wrapping, etc.) is correct, perhaps it only affects processes actually accepting or handling a connection in a particular state at the time.

I have not yet upgraded beyond 1.7.3.

Best,
-=Mark

Reply via email to