Denny,
I should have mentioned this as well. Any ceph cluster wide checks I am
doing with Icinga are only applied to my 3 mon/mgr nodes. They would
definitely be annoying if it was on all osd nodes. Having the checks on
all of the mons allows me to not lose monitoring ability should one go down.
The ceph mgr dashboard is only enabled on the mgr daemons. I'm not
familiar with the mimic dashboard yet, but it is much more advanced than
luminous' dashboard and may have some alerting abilities built in.
With your PCI DSS restrictions a VM monitoring node may work well. I'd
set up the VM with ceph-common, the conf and a restricted keyring then
have icinga2 run a nrpe check on it that calls the check_ceph, ceph -s,
or whaterver.
Kevin
On 06/19/2018 04:13 PM, Denny Fuchs wrote:
hi,
Am 19.06.2018 um 17:17 schrieb Kevin Hrpcek <kevin.hrp...@ssec.wisc.edu>:
# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
key = <nope>
caps mgr = "allow r"
caps mon = "allow r"
thats the point: It's OK, to check, if all processes are up and running and may some checks for the
disks. But imagine, you check the "health" state and the state is on all OSDs the same,
because ... its a cluster. So if you put on one node "ceph osd set noout", you get a
warning for every OSD node (check_ceph_health). The same for every check, that monitors a cluster
wide setting, like df, lost osd (70 in from 72 ...) The most checks have also performance data
(which can be disabled), which is saved in a database.
The same for Telegraf(*): every node transmits the same data (because the
cluster data is the same on all nodes).
I've took also a look on the Ceph mgr dashboard (for a few minutes), which I
have to enable on all(?) OSD nodes and build a construct, to get the dashboard
from the active mgr.
I don't believe, that I'm the first person who thinking about a dedicate VM,
which is only used for monitoring tools (Icinga / Zabbix / Nagios / Dashboard /
ceph -s) and get the overall status (and performance data) from it. The only
thing which I need to keep is the OSD (I/O) disk and network on the OSD nodes
directly, but thanks to InfluxDB ... I can put them on one dashboard :-)
@Kevin nice work :-) Because of PCI DSS, the Icinga2 master can't reach the
Ceph directly, to we have satellite / agent construct to the checks executed.
cu denny
ps. One bad thing: Telegraf can't read the /var/run/ceph/*sock files, because
of the perms after the OSD services starts
(https://github.com/influxdata/telegraf/issues/1657). This was fixed, but I
didn't checked, if this patch was also in Proxmox Ceph packages included.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com