I always sympathize with a major outage report, but on the bright side, it was very satisfying to hear the ZooKeeper cluster had sustained uptime for 3 years. That agrees with my own user experience. It's often the most stable component of a distributed infrastructure (as it needs to be).
As far as potential improvements, I was wondering if it would make sense to introduce something like Hadoop's JvmPauseMonitor [1]. This is a background thread that attempts to detect GC churn and log warnings about it. This has been very helpful in diagnosing NameNode misconfigurations that lead to GC churn. This wouldn't have prevented a problem for the Elastic Cloud team, but at least it would have made the root cause more visible. A warning about GC churn could have been shown in the main ZooKeeper log instead of a separate GC log or inferring it from other sources like JMX. [1] https://s.apache.org/4sdx --Chris Nauroth On 5/8/16, 7:37 PM, "Patrick Hunt" <[email protected]> wrote: >Interesting root cause and mitigations discussion. > >https://www.elastic.co/blog/elastic-cloud-outage-april-2016 > >Patrick
