Clean nautilous install I setup last week

 * 5 Proxmox nodes
     o All on latest updates via no-subscription channel
 * 18 OSD's
 * 3 Managers
 * 3 Monitors
 * Cluster Heal good
 * In a protracted rebalance phase
 * All managed via proxmox

I thought I would enable telemetry for caph as per this article:

https://docs.ceph.com/docs/master/mgr/telemetry/


 * Enabled the module (command line)
 * ceph telemetry on
 * Tested getting the status
 * Set the contact and description
   ceph config set mgr mgr/telemetry/contact 'John Doe
   <john....@example.com>'
   ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
   ceph config set mgr mgr/telemetry/channel_ident true
 * Tried sending it
   ceph telemetry send

I *think* this is when the managers died, but it could have been earlier. But around then the all ceph IO stopped and I discovered all three managers had crashed and would not restart. I was shitting myself because this was remote and the router is a pfSense VM :) Fortunately it kept going without its disk responding.

systemctl start ceph-mgr@vni.service
Job for ceph-mgr@vni.service failed because the control process exited with error code. See "systemctl status ceph-mgr@vni.service" and "journalctl -xe" for details.

From journalcontrol -xe

   -- The unit ceph-mgr@vni.service has entered the 'failed' state with
   result 'exit-code'.
   Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
   daemon.
   -- Subject: A start job for unit ceph-mgr@vni.service has failed
   -- Defined-By: systemd
   -- Support: https://www.debian.org/support
   --
   -- A start job for unit ceph-mgr@vni.service has finished with a
   failure.
   --
   -- The job identifier is 91690 and the job result is failed.


From systemctl status ceph-mgr@vni.service

ceph-mgr@vni.service - Ceph cluster manager daemon
   Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/ceph-mgr@.service.d
           └─ceph-after-pve-cluster.conf
   Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST; 8min ago   Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
 Main PID: 415566 (code=exited, status=1/FAILURE)

Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Service RestartSec=10s expired, scheduling restart. Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Scheduled restart job, restart counter is at 4.
Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Start request repeated too quickly. Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Failed with result 'exit-code'.
Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon.

I created a new manager service on an unused node and fortunately that worked. I deleted/recreated the old managers and they started working. It was a sweaty few minutes :)


Everything resumed without a hiccup after that, impressed. Not game to try and reproduce it though.



--
Lindsay

_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to