We don't currently monitor that, but my todo list has an item to monitor for 
blocked requests longer than 500 seconds to critical on.  You can see how long 
they've been blocked for from `ceph health detail`.  Our cluster doesn't need 
to be super fast at any given point, but it does need to be progressing.

________________________________

[cid:[email protected]]<https://storagecraft.com>       David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

________________________________

If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

________________________________

________________________________
From: Chris Jones [[email protected]]
Sent: Friday, January 13, 2017 1:31 PM
To: David Turner
Cc: [email protected]
Subject: Re: [ceph-users] Ceph Monitoring

Thanks.

What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for 
those type and if so what criteria do you use?

Thanks again!

On Fri, Jan 13, 2017 at 3:28 PM, David Turner 
<[email protected]<mailto:[email protected]>> wrote:

We don't use many critical alerts (that will have our NOC wake up an engineer), 
but the main one that we do have is a check that tells us if there are 2 or 
more hosts with osds that are down.  We have clusters with 60 servers in them, 
so having an osd die and backfill off of isn't something to wake up for in the 
middle of the night, but having osds down on 2 servers is 1 osd away from data 
loss.  A quick reference to how to do this check in bash is below.

hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1 down | grep 
host | wc -l`
if [ $hosts_with_down_osds -ge 2 ]
then
    echo critical
elif [ $hosts_with_down_osds -eq 1 ]
then
    echo warning
elif [ $hosts_with_down_osds -eq 0 ]
then
    echo ok
else
    echo unknown
fi

________________________________

[cid:[email protected]]<https://storagecraft.com>       David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760<tel:(801)%20871-2760> | Mobile: 
385.224.2943<tel:(385)%20224-2943>

________________________________

If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

________________________________

________________________________
From: ceph-users 
[[email protected]<mailto:[email protected]>] 
on behalf of Chris Jones [[email protected]<mailto:[email protected]>]
Sent: Friday, January 13, 2017 1:15 PM
To: [email protected]<mailto:[email protected]>
Subject: [ceph-users] Ceph Monitoring

General question/survey:

Those that have larger clusters, how are you doing alerting/monitoring? 
Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about 
collectd related but more on initial alerts of an issue or potential issue? 
What threshold do you use basically? Just trying to get a pulse of what others 
are doing.

Thanks in advance.

--
Best Regards,
Chris Jones
​Bloomberg​






--
Best Regards,
Chris Jones

[email protected]<mailto:[email protected]>
(p) 770.655.0770

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to