RE: Looking for feedback on automated root-cause system

Kenneth Brotman Sun, 24 Feb 2019 19:50:18 -0800

Sounds like a promising step forward.  I’d certainly like to know when the blog 
posts are up.

Kenneth Brotman

From: Matt Stump [mailto:mrevilgn...@gmail.com] 
Sent: Friday, February 22, 2019 7:56 AM
To: user
Subject: Re: Looking for feedback on automated root-cause system

For some reason responses to the thread didn't hit my work email, I didn't see 
the responses until I check from my personal. 

The way that the system works is that we install a collector that pulls a bunch 
of metrics from each node and sends it up to our NOC every minute. We've got a 
bunch of stream processors that take this data and do a bunch of things with 
it. We've got some dumb ones that check for common miss-configurations, bugs 
etc.. they also populate dashboards and a couple of minimal graphs. The more 
intelligent agents take a look at the metrics and they start generating a bunch 
of calculated/scaled metrics and events. If one of these triggers a threshold 
then we kick off the ML that does classification using the stored data to 
classify the root cause, and point you to the correct knowledge base article 
with remediation steps. Because we've got he cluster history we can identify a 
breach, and give you an SLA in about 1 minute. The goal is to get you from 0 to 
resolution as quickly as possible. 

We're looking for feedback on the existing system, do these events make sense, 
do I need to beef up a knowledge base article, did it classify correctly, or is 
there some big bug that everyone is running into that needs to be publicized. 
We're also looking for where to go next, which models are going to make your 
life easier?

The system works for C*, Elastic and Kafka. We'll be doing some blog posts 
explaining in more detail how it works and some of the interesting things we've 
found. For example everything everyone thought they knew about Cassandra thread 
pool tuning is wrong, nobody really knows how to tune Kafka for large messages, 
or that there are major issues with the Kubernetes charts that people are using.

On Tue, Feb 19, 2019 at 4:40 PM Kenneth Brotman <kenbrot...@yahoo.com.invalid> 
wrote:

Any information you can share on the inputs it needs/uses would be helpful.

Kenneth Brotman

From: daemeon reiydelle [mailto:daeme...@gmail.com] 
Sent: Tuesday, February 19, 2019 4:27 PM
To: user
Subject: Re: Looking for feedback on automated root-cause system

Welcome to the world of testing predictive analytics. I will pass this on to my 
folks at Accenture, know of a couple of C* clients we run, wondering what you 
had in mind?

Daemeon C.M. Reiydelle

email: daeme...@gmail.com

San Francisco 1.415.501.0198/London 44 020 8144 9872/Skype daemeon.c.mreiydelle

On Tue, Feb 19, 2019 at 3:35 PM Matthew Stump <mst...@vorstella.com> wrote:

Howdy,

I’ve been engaged in the Cassandra user community for a long time, almost 8 
years, and have worked on hundreds of Cassandra deployments. One of the things 
I’ve noticed in myself and a lot of my peers that have done consulting, support 
or worked on really big deployments is that we get burnt out. We fight a lot of 
the same fires over and over again, and don’t get to work on new or interesting 
stuff Also, what we do is really hard to transfer to other people because it’s 
based on experience. 

Over the past year my team and I have been working to overcome that gap, 
creating an assistant that’s able to scale some of this knowledge. We’ve got it 
to the point where it’s able to classify known root causes for an outage or an 
SLA breach in Cassandra with an accuracy greater than 90%. It can accurately 
diagnose bugs, data-modeling issues, or misuse of certain features and when it 
does give you specific remediation steps with links to knowledge base articles. 

We think we’ve seeded our database with enough root causes that it’ll catch the 
vast majority of issues but there is always the possibility that we’ll run into 
something previously unknown like CASSANDRA-11170 (one of the issues our system 
found in the wild).

We’re looking for feedback and would like to know if anyone is interested in 
giving the product a trial. The process would be a collaboration, where we both 
get to learn from each other and improve how we’re doing things.

Thanks,
Matt Stump

RE: Looking for feedback on automated root-cause system

Reply via email to