Hi All Just like to confirm Willy's theory, we had the hang at exactly the time specified this morning.
Sadly due to a bank holiday yesterday in the UK, we didn't set up the truss and monitoring before the hang occurred. Was the hang seen by everyone? Thanks Dave On 6 April 2017 at 14:56, Mark S <mark.staudin...@nyi.net> wrote: > On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber <d...@skunkwerks.at> > wrote: > > On Mon, 13 Mar 2017, at 13:31, David King wrote: >> >>> Hi All >>> >>> Apologies for the delay in response, i've been out of the country for the >>> last week >>> >>> Mark, my gut feeling is that is network related in someway, so thought we >>> could compare the networking setup of our systems >>> >>> You mentioned you see the hang across geo locations, so i assume there >>> isn't layer 2 connectivity between all of the hosts? is there any back >>> end >>> connectivity between the haproxy hosts? >>> >> >> Following up on this, some interesting points but nothing useful. >> >> - Mark & I see the hang at almost exactly the same time on the same day: >> 2017-02-27T14:36Z give or take a minute either way >> >> - I see the hang in 3 different regions using 2 different hosting >> providers on both clustered and non-clustered services, but all on >> FreeBSD 11.0R amd64. There is some dependency between these systems but >> nothing unusual (logging backends, reverse proxied services etc). >> >> - our servers don't have a specific workload that would allow them all >> to run out of some internal resource at the same time, as their reboot >> and patch cycles are reasonably different - typically a few days elapse >> between first patches and last reboots unless its deemed high risk >> >> - our networking setup is not complex but typical FreeBSD: >> - LACP bonded Gbit igb(4) NICs >> - CARP failover for both ipv4 & ipv6 addresses >> - either direct to haproxy for http & TLS traffic, or via spiped to >> decrypt intra-server traffic >> - haproxy directs traffic into jailed services >> - our overall load and throughput is low but consistent >> - pf firewall >> - rsyslog for logging, along with riemann and graphite for metrics >> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy >> - haproxy 1.6.10 + libressl at the time >> >> As I'm not one for conspiracy theories or weird coincidences, somebody >> port scanning the internet with an Unexpectedly Evil Packet Combo seems >> the most plausible explanation. I cannot find an alternative that would >> fit the scenario of 3 different organisations with geographically >> distributed equipment and unconnected services reporting an unusual >> interruption on the same day and almost the same time. >> >> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest >> libressl and seen no recurrence, just like the last 8+ months or so >> since first deploying haproxy on FreeBSD instead of debian & nginx. >> >> If the issue recurs I plan to run a small cyclic traffic capture with >> tcpdump and wait for a re-repeat, see >> https://superuser.com/questions/286062/practical-tcpdump-examples >> >> Let me know if I can help or clarify further. >> >> A+ >> Dave >> > > Hi Dave, > > Thanks for keeping this thread going. As for the initial report with all > servers hanging, I too run NTP (actually OpenNTPd), and these only speak to > in-house stratum-2 servers. > > As a follow-up to my initial report, I upgraded to 1.7.3 shortly > thereafter. > > I've had one re-occurrence of this "hang" but this time, it did not affect > all of my servers, instead, it affected only 2 (the busier ones). If the > theory about some timing event ( leap second, counter wrapping, etc.) is > correct, perhaps it only affects processes actually accepting or handling a > connection in a particular state at the time. > > I have not yet upgraded beyond 1.7.3. > > Best, > -=Mark >