Hi All

Just like to confirm Willy's theory, we had the hang at exactly the time
specified this morning.

Sadly due to a bank holiday yesterday in the UK, we didn't set up the truss
and monitoring before the hang occurred.

Was the hang seen by everyone?

Thanks

Dave

On 6 April 2017 at 14:56, Mark S <mark.staudin...@nyi.net> wrote:

> On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber <d...@skunkwerks.at>
> wrote:
>
> On Mon, 13 Mar 2017, at 13:31, David King wrote:
>>
>>> Hi All
>>>
>>> Apologies for the delay in response, i've been out of the country for the
>>> last week
>>>
>>> Mark, my gut feeling is that is network related in someway, so thought we
>>> could compare the networking setup of our systems
>>>
>>> You mentioned you see the hang across geo locations, so i assume there
>>> isn't layer 2 connectivity between all of the hosts? is there any back
>>> end
>>> connectivity between the haproxy hosts?
>>>
>>
>> Following up on this, some interesting points but nothing useful.
>>
>> - Mark & I see the hang at almost exactly the same time on the same day:
>> 2017-02-27T14:36Z give or take a minute either way
>>
>> - I see the hang in 3 different regions using 2 different hosting
>> providers on both clustered and non-clustered services, but all on
>> FreeBSD 11.0R amd64. There is some dependency between these systems but
>> nothing unusual (logging backends, reverse proxied services etc).
>>
>> - our servers don't have a specific workload that would allow them all
>> to run out of some internal resource at the same time, as their reboot
>> and patch cycles are reasonably different - typically a few days elapse
>> between first patches and last reboots unless its deemed high risk
>>
>> - our networking setup is not complex but typical FreeBSD:
>>     - LACP bonded Gbit igb(4) NICs
>>     - CARP failover for both ipv4 & ipv6 addresses
>>     - either direct to haproxy for http & TLS traffic, or via spiped to
>>     decrypt intra-server traffic
>>     - haproxy directs traffic into jailed services
>> - our overall load and throughput is low but consistent
>> - pf firewall
>> - rsyslog for logging, along with riemann and graphite for metrics
>> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
>> - haproxy 1.6.10 + libressl at the time
>>
>> As I'm not one for conspiracy theories or weird coincidences, somebody
>> port scanning the internet with an Unexpectedly Evil Packet Combo seems
>> the most plausible explanation.  I cannot find an alternative that would
>> fit the scenario of 3 different organisations with geographically
>> distributed equipment and unconnected services reporting an unusual
>> interruption on the same day and almost the same time.
>>
>> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
>> libressl and seen no recurrence, just like the last 8+ months or so
>> since first deploying haproxy on FreeBSD instead of debian & nginx.
>>
>> If the issue recurs I plan to run a small cyclic traffic capture with
>> tcpdump and wait for a re-repeat, see
>> https://superuser.com/questions/286062/practical-tcpdump-examples
>>
>> Let me know if I can help or clarify further.
>>
>> A+
>> Dave
>>
>
> Hi Dave,
>
> Thanks for keeping this thread going.  As for the initial report with all
> servers hanging, I too run NTP (actually OpenNTPd), and these only speak to
> in-house stratum-2 servers.
>
> As a follow-up to my initial report, I upgraded to 1.7.3 shortly
> thereafter.
>
> I've had one re-occurrence of this "hang" but this time, it did not affect
> all of my servers, instead, it affected only 2 (the busier ones).  If the
> theory about some timing event ( leap second, counter wrapping, etc.) is
> correct, perhaps it only affects processes actually accepting or handling a
> connection in a particular state at the time.
>
> I have not yet upgraded beyond 1.7.3.
>
> Best,
> -=Mark
>

Reply via email to