Re: Interesting elastic/ZK post

Chris Nauroth Mon, 09 May 2016 10:12:47 -0700

I always sympathize with a major outage report, but on the bright side, it
was very satisfying to hear the ZooKeeper cluster had sustained uptime for
3 years.  That agrees with my own user experience.  It's often the most
stable component of a distributed infrastructure (as it needs to be).

As far as potential improvements, I was wondering if it would make sense
to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
background thread that attempts to detect GC churn and log warnings about
it.  This has been very helpful in diagnosing NameNode misconfigurations
that lead to GC churn.

This wouldn't have prevented a problem for the Elastic Cloud team, but at
least it would have made the root cause more visible.  A warning about GC
churn could have been shown in the main ZooKeeper log instead of a
separate GC log or inferring it from other sources like JMX.

[1] https://s.apache.org/4sdx

--Chris Nauroth

On 5/8/16, 7:37 PM, "Patrick Hunt" <[email protected]> wrote:

>Interesting root cause and mitigations discussion.
>
>https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>
>Patrick

Re: Interesting elastic/ZK post

Reply via email to