Nice save. And thanks for the detailed info.
On Thursday, June 18, 2020, Lindsay Mathieson <lindsay.mathie...@gmail.com> wrote: > Clean nautilous install I setup last week > > * 5 Proxmox nodes > o All on latest updates via no-subscription channel > * 18 OSD's > * 3 Managers > * 3 Monitors > * Cluster Heal good > * In a protracted rebalance phase > * All managed via proxmox > > I thought I would enable telemetry for caph as per this article: > > https://docs.ceph.com/docs/master/mgr/telemetry/ > > > * Enabled the module (command line) > * ceph telemetry on > * Tested getting the status > * Set the contact and description > ceph config set mgr mgr/telemetry/contact 'John Doe > <john....@example.com>' > ceph config set mgr mgr/telemetry/description 'My first Ceph cluster' > ceph config set mgr mgr/telemetry/channel_ident true > * Tried sending it > ceph telemetry send > > I *think* this is when the managers died, but it could have been earlier. But around then the all ceph IO stopped and I discovered all three managers had crashed and would not restart. I was shitting myself because this was remote and the router is a pfSense VM :) Fortunately it kept going without its disk responding. > > systemctl start ceph-mgr@vni.service > Job for ceph-mgr@vni.service failed because the control process exited with error code. > See "systemctl status ceph-mgr@vni.service" and "journalctl -xe" for details. > > From journalcontrol -xe > > -- The unit ceph-mgr@vni.service has entered the 'failed' state with > result 'exit-code'. > Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager > daemon. > -- Subject: A start job for unit ceph-mgr@vni.service has failed > -- Defined-By: systemd > -- Support: https://www.debian.org/support > -- > -- A start job for unit ceph-mgr@vni.service has finished with a > failure. > -- > -- The job identifier is 91690 and the job result is failed. > > > From systemctl status ceph-mgr@vni.service > > ceph-mgr@vni.service - Ceph cluster manager daemon > Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled) > Drop-In: /lib/systemd/system/ceph-mgr@.service.d > └─ceph-after-pve-cluster.conf > Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST; 8min ago > Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) > Main PID: 415566 (code=exited, status=1/FAILURE) > > Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Service RestartSec=10s expired, scheduling restart. > Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Scheduled restart job, restart counter is at 4. > Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon. > Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Start request repeated too quickly. > Jun 18 20:53:52 vni systemd[1]: ceph-mgr@vni.service: Failed with result 'exit-code'. > Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager daemon. > > I created a new manager service on an unused node and fortunately that worked. I deleted/recreated the old managers and they started working. It was a sweaty few minutes :) > > > Everything resumed without a hiccup after that, impressed. Not game to try and reproduce it though. > > > > -- > Lindsay > > _______________________________________________ > pve-user mailing list > pve-user@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user