Denny,

I should have mentioned this as well. Any ceph cluster wide checks I am doing with Icinga are only applied to my 3 mon/mgr nodes. They would definitely be annoying if it was on all osd nodes. Having the checks on all of the mons allows me to not lose monitoring ability should one go down.

The ceph mgr dashboard is only enabled on the mgr daemons. I'm not familiar with the mimic dashboard yet, but it is much more advanced than luminous' dashboard and may have some alerting abilities built in.

With your PCI DSS restrictions a VM monitoring node may work well. I'd set up the VM with ceph-common, the conf and a restricted keyring then have icinga2 run a nrpe check on it that calls the check_ceph, ceph -s, or whaterver.

Kevin

On 06/19/2018 04:13 PM, Denny Fuchs wrote:
hi,

Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek <kevin.hrp...@ssec.wisc.edu>:

# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
     key = <nope>
     caps mgr = "allow r"
     caps mon = "allow r"
thats the point: It's OK, to check, if all processes are up and running and may some checks for the 
disks. But imagine, you check the "health" state and the state is on all OSDs the same, 
because ... its a cluster. So if you put on one node "ceph osd set noout", you get a 
warning for every OSD node (check_ceph_health). The same for every check, that monitors a cluster 
wide setting, like df, lost osd (70 in from 72 ...) The most checks have also performance data 
(which can be disabled), which is saved in a database.
The same for Telegraf(*): every node transmits the same data (because the 
cluster data is the same on all nodes).

I've took also a look on the Ceph mgr dashboard (for a few minutes), which I 
have to enable on all(?) OSD nodes and build a construct, to get the dashboard 
from the active mgr.

I don't believe, that I'm the first person who thinking about a dedicate VM, 
which is only used for monitoring tools (Icinga / Zabbix / Nagios / Dashboard / 
ceph -s) and get the overall status (and performance data) from it. The only 
thing which I need to keep is the OSD (I/O) disk and network on the OSD nodes 
directly, but thanks to InfluxDB ... I can put them on one dashboard :-)

@Kevin nice work :-) Because of PCI DSS, the Icinga2 master can't reach the 
Ceph directly, to we have satellite / agent construct to the checks executed.

cu denny

ps. One bad thing: Telegraf can't read the /var/run/ceph/*sock files, because 
of the perms after the OSD services starts 
(https://github.com/influxdata/telegraf/issues/1657). This was fixed, but I 
didn't checked, if this patch was also in Proxmox Ceph packages included.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to