On Fri, 13 Jan 2017 at 22:15, Chris Jones <[email protected]> wrote:
> General question/survey: > > Those that have larger clusters, how are you doing alerting/monitoring? > Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about > collectd related but more on initial alerts of an issue or potential issue? > What threshold do you use basically? Just trying to get a pulse of what > others are doing. > > Thanks in advance. > > -- > Best Regards, > Chris Jones > Bloomberg > > Hi, > > We monitor for 'low iops'. The number differs on our clusters. For example > if we have only 3000 iops per second, there is something wrong going on. > > Another good check is for s3 api. We try to read an object from s3 api > every 30 seconds. > > Also we have many checks like more than 10% osds are down, pg inactive, > cluster has degradated capacity and similiar. Some of these checks are not > critical and we get only emails. > > One more important thing is disk latency monitoring. We've had huge > slowdowns on our cluster when journalling ssd disks wear out. It's quite > hard to understand what's going on, because all osds are up and running, > but cluster is not performing at all. > > Network.errors on interfaces could be important. We had some issues, when > physical cable was mulfunctioning and cluster had many blocks. >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
