We don't currently monitor that, but my todo list has an item to monitor for blocked requests longer than 500 seconds to critical on. You can see how long they've been blocked for from `ceph health detail`. Our cluster doesn't need to be super fast at any given point, but it does need to be progressing.
________________________________ [cid:[email protected]]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2760 | Mobile: 385.224.2943 ________________________________ If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ________________________________ ________________________________ From: Chris Jones [[email protected]] Sent: Friday, January 13, 2017 1:31 PM To: David Turner Cc: [email protected] Subject: Re: [ceph-users] Ceph Monitoring Thanks. What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for those type and if so what criteria do you use? Thanks again! On Fri, Jan 13, 2017 at 3:28 PM, David Turner <[email protected]<mailto:[email protected]>> wrote: We don't use many critical alerts (that will have our NOC wake up an engineer), but the main one that we do have is a check that tells us if there are 2 or more hosts with osds that are down. We have clusters with 60 servers in them, so having an osd die and backfill off of isn't something to wake up for in the middle of the night, but having osds down on 2 servers is 1 osd away from data loss. A quick reference to how to do this check in bash is below. hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down | grep host | wc -l` if [ $hosts_with_down_osds -ge 2 ] then echo critical elif [ $hosts_with_down_osds -eq 1 ] then echo warning elif [ $hosts_with_down_osds -eq 0 ] then echo ok else echo unknown fi ________________________________ [cid:[email protected]]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2760<tel:(801)%20871-2760> | Mobile: 385.224.2943<tel:(385)%20224-2943> ________________________________ If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ________________________________ ________________________________ From: ceph-users [[email protected]<mailto:[email protected]>] on behalf of Chris Jones [[email protected]<mailto:[email protected]>] Sent: Friday, January 13, 2017 1:15 PM To: [email protected]<mailto:[email protected]> Subject: [ceph-users] Ceph Monitoring General question/survey: Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts of an issue or potential issue? What threshold do you use basically? Just trying to get a pulse of what others are doing. Thanks in advance. -- Best Regards, Chris Jones Bloomberg -- Best Regards, Chris Jones [email protected]<mailto:[email protected]> (p) 770.655.0770
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
