Hi Guys, I got used to using ganglia liked software for monitoring and trouble shooting cluster with about 100 machines. But with the growth of scale, I found out it became more difficult to identify the abnormal metrics, machines or the bottle-net part of the current system.
Up to now, we considered to add some features for rrd viewing, such as getting the topN, sorting the machine by its metrics, or grouping the metrics to find its distribution. We have no more experience on chukwa before and I am wondering that is there any templates for metrics processing from chukwa (such as sorting, histogram, machine/rack group distribution) ? If you have better idea for viewing these metrics. Would you mind introducing it?
