[PVE-User] Enabling telemetry broke all my ceph managers

Lindsay Mathieson Thu, 18 Jun 2020 04:32:23 -0700

Clean nautilous install I setup last week

 * 5 Proxmox nodes
     o All on latest updates via no-subscription channel
 * 18 OSD's
 * 3 Managers
 * 3 Monitors
 * Cluster Heal good
 * In a protracted rebalance phase
 * All managed via proxmox


I thought I would enable telemetry for caph as per this article:

https://docs.ceph.com/docs/master/mgr/telemetry/


 * Enabled the module (command line)
 * ceph telemetry on
 * Tested getting the status
 * Set the contact and description
   ceph config set mgr mgr/telemetry/contact 'John Doe
   <[email protected]>'
   ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
   ceph config set mgr mgr/telemetry/channel_ident true
 * Tried sending it
   ceph telemetry send

I *think* this is when the managers died, but it could have beenearlier. But around then the all ceph IO stopped and I discovered allthree managers had crashed and would not restart. I was shitting myselfbecause this was remote and the router is a pfSense VM :) Fortunately itkept going without its disk responding.


systemctl start [email protected]

Job for [email protected] failed because the control process exitedwith error code.See "systemctl status [email protected]" and "journalctl -xe" fordetails.


From journalcontrol -xe

   -- The unit [email protected] has entered the 'failed' state with
   result 'exit-code'.
   Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
   daemon.
   -- Subject: A start job for unit [email protected] has failed
   -- Defined-By: systemd
   -- Support: https://www.debian.org/support
   --
   -- A start job for unit [email protected] has finished with a
   failure.
   --
   -- The job identifier is 91690 and the job result is failed.


From systemctl status [email protected]

[email protected] - Ceph cluster manager daemon

Loaded: loaded (/lib/systemd/system/[email protected]; enabled;vendor preset: enabled)

  Drop-In: /lib/systemd/system/[email protected]
           └─ceph-after-pve-cluster.conf

Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52AEST; 8min ago Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}--id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)

 Main PID: 415566 (code=exited, status=1/FAILURE)

Jun 18 20:53:52 vni systemd[1]: [email protected]: ServiceRestartSec=10s expired, scheduling restart.Jun 18 20:53:52 vni systemd[1]: [email protected]: Scheduled restartjob, restart counter is at 4.

Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.

Jun 18 20:53:52 vni systemd[1]: [email protected]: Start requestrepeated too quickly.Jun 18 20:53:52 vni systemd[1]: [email protected]: Failed with result'exit-code'.

Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon.

I created a new manager service on an unused node and fortunately thatworked. I deleted/recreated the old managers and they started working.It was a sweaty few minutes :)

Everything resumed without a hiccup after that, impressed. Not game totry and reproduce it though.




--
Lindsay

_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

[PVE-User] Enabling telemetry broke all my ceph managers

Reply via email to