On Fri, 13 Jan 2017 at 22:15, Chris Jones <[email protected]> wrote:

> General question/survey:
>
> Those that have larger clusters, how are you doing alerting/monitoring?
> Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about
> collectd related but more on initial alerts of an issue or potential issue?
> What threshold do you use basically? Just trying to get a pulse of what
> others are doing.
>
> Thanks in advance.
>
> --
> Best Regards,
> Chris Jones
> ​Bloomberg​
>
> Hi,
>
> We monitor for 'low iops'. The number differs on our clusters. For example
> if we have only 3000 iops per second, there is something wrong going on.
>
> Another good check is for s3 api. We try to read an object from s3 api
> every 30 seconds.
>
> Also we have many checks like more than 10% osds are down, pg inactive,
> cluster has degradated capacity and similiar. Some of these checks are not
> critical and we get only emails.
>
> One more important thing is disk latency monitoring. We've had huge
> slowdowns on our cluster when journalling ssd disks wear out. It's quite
> hard to understand what's going on, because all osds are up and running,
> but cluster is not performing at all.
>
> Network.errors on interfaces could be important. We had some issues, when
> physical cable was mulfunctioning and cluster had many blocks.
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to