This is a very good point, I'm thinking about this for years. Node failures should be easy to monitor by OS services. But latency spikes are totally different.
It is a very, very hard job to measure anomalies in latency correctly. Just consider the skews of wrong programming, or of the hostile environments JVMs do run in (clocks, OSes, VMs, ...) If anomalies are detected wrongly, no or false alerts are emitted, and all of the effort would lead to annoyance or frustration. Lately I read about Gil Tene's LatencyUtils https://github.com/LatencyUtils/LatencyUtils https://groups.google.com/forum/#!topic/mechanical-sympathy/oZSv5QnpAYs which I find a promising tool to measure anomalies in histograms. Some of this might be possible to get implemented by an ES plugin, but I haven't tried LatencyUtils yet, and how it can be connected to ES metrics is still open to me. Jörg On Thu, Mar 6, 2014 at 7:24 PM, T Vinod Gupta <[email protected]> wrote: > is there a plugin or api support for monitoring ES key metrics and > alerting the dev ops about situations when some node in a cluster fails or > there is a spike in latency due to whatever reason? > > what are the best practices here and what do people usually do? > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXNqJkF5uL2oCKmBsHYqQJxFdxUrW%2BF0maVSJupOGupQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
