Makes sense to me to add it. Someone could create a ZK jira? Sounds like a great starter project for someone interested to get rolling with ZK. 3.5+ adds jetty support for accessing metrics, sounds like it would dovetail nicely.
Patrick On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <[email protected]> wrote: > I always sympathize with a major outage report, but on the bright side, it > was very satisfying to hear the ZooKeeper cluster had sustained uptime for > 3 years. That agrees with my own user experience. It's often the most > stable component of a distributed infrastructure (as it needs to be). > > As far as potential improvements, I was wondering if it would make sense > to introduce something like Hadoop's JvmPauseMonitor [1]. This is a > background thread that attempts to detect GC churn and log warnings about > it. This has been very helpful in diagnosing NameNode misconfigurations > that lead to GC churn. > > This wouldn't have prevented a problem for the Elastic Cloud team, but at > least it would have made the root cause more visible. A warning about GC > churn could have been shown in the main ZooKeeper log instead of a > separate GC log or inferring it from other sources like JMX. > > [1] https://s.apache.org/4sdx > > --Chris Nauroth > > > > > On 5/8/16, 7:37 PM, "Patrick Hunt" <[email protected]> wrote: > > >Interesting root cause and mitigations discussion. > > > >https://www.elastic.co/blog/elastic-cloud-outage-april-2016 > > > >Patrick > >
