Re: [PVE-User] pve-user Digest, Vol 147, Issue 10

Oleksii Tokovenko Mon, 22 Jun 2020 12:59:04 -0700

unsubscribe

пт, 19 черв. 2020 о 13:00 <[email protected]> пише:


> Send pve-user mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of pve-user digest..."
>
>
> Today's Topics:
>
>    1. Enabling telemetry broke all my ceph managers (Lindsay Mathieson)
>    2. Re: Enabling telemetry broke all my ceph managers (Brian :)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 18 Jun 2020 21:30:38 +1000
> From: Lindsay Mathieson <[email protected]>
> To: PVE User List <[email protected]>
> Subject: [PVE-User] Enabling telemetry broke all my ceph managers
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Clean nautilous install I setup last week
>
>   * 5 Proxmox nodes
>       o All on latest updates via no-subscription channel
>   * 18 OSD's
>   * 3 Managers
>   * 3 Monitors
>   * Cluster Heal good
>   * In a protracted rebalance phase
>   * All managed via proxmox
>
> I thought I would enable telemetry for caph as per this article:
>
> https://docs.ceph.com/docs/master/mgr/telemetry/
>
>
>   * Enabled the module (command line)
>   * ceph telemetry on
>   * Tested getting the status
>   * Set the contact and description
>     ceph config set mgr mgr/telemetry/contact 'John Doe
>     <[email protected]>'
>     ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
>     ceph config set mgr mgr/telemetry/channel_ident true
>   * Tried sending it
>     ceph telemetry send
>
> I *think* this is when the managers died, but it could have been
> earlier. But around then the all ceph IO stopped and I discovered all
> three managers had crashed and would not restart. I was shitting myself
> because this was remote and the router is a pfSense VM :) Fortunately it
> kept going without its disk responding.
>
> systemctl start [email protected]
> Job for [email protected] failed because the control process exited
> with error code.
> See "systemctl status [email protected]" and "journalctl -xe" for
> details.
>
>  From journalcontrol -xe
>
>     -- The unit [email protected] has entered the 'failed' state with
>     result 'exit-code'.
>     Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
>     daemon.
>     -- Subject: A start job for unit [email protected] has failed
>     -- Defined-By: systemd
>     -- Support: https://www.debian.org/support
>     --
>     -- A start job for unit [email protected] has finished with a
>     failure.
>     --
>     -- The job identifier is 91690 and the job result is failed.
>
>
>  From systemctl status [email protected]
>
> [email protected] - Ceph cluster manager daemon
>  ?? Loaded: loaded (/lib/systemd/system/[email protected]; enabled;
> vendor preset: enabled)
>  ? Drop-In: /lib/systemd/system/[email protected]
>  ?????????? ??ceph-after-pve-cluster.conf
>  ?? Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52
> AEST; 8min ago
>  ? Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
> --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
>  ?Main PID: 415566 (code=exited, status=1/FAILURE)
>
> Jun 18 20:53:52 vni systemd[1]: [email protected]: Service
> RestartSec=10s expired, scheduling restart.
> Jun 18 20:53:52 vni systemd[1]: [email protected]: Scheduled restart
> job, restart counter is at 4.
> Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> Jun 18 20:53:52 vni systemd[1]: [email protected]: Start request
> repeated too quickly.
> Jun 18 20:53:52 vni systemd[1]: [email protected]: Failed with result
> 'exit-code'.
> Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
>
> I created a new manager service on an unused node and fortunately that
> worked. I deleted/recreated the old managers and they started working.
> It was a sweaty few minutes :)
>
>
> Everything resumed without a hiccup after that, impressed. Not game to
> try and reproduce it though.
>
>
>
> --
> Lindsay
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 18 Jun 2020 23:06:40 +0100
> From: "Brian :" <[email protected]>
> To: PVE User List <[email protected]>
> Subject: Re: [PVE-User] Enabling telemetry broke all my ceph managers
> Message-ID:
>         <CAGPQfi_xwebe=
> [email protected]>
> Content-Type: text/plain; charset="UTF-8"
>
> Nice save. And thanks for the detailed info.
>
> On Thursday, June 18, 2020, Lindsay Mathieson <[email protected]
> >
> wrote:
> > Clean nautilous install I setup last week
> >
> >  * 5 Proxmox nodes
> >      o All on latest updates via no-subscription channel
> >  * 18 OSD's
> >  * 3 Managers
> >  * 3 Monitors
> >  * Cluster Heal good
> >  * In a protracted rebalance phase
> >  * All managed via proxmox
> >
> > I thought I would enable telemetry for caph as per this article:
> >
> > https://docs.ceph.com/docs/master/mgr/telemetry/
> >
> >
> >  * Enabled the module (command line)
> >  * ceph telemetry on
> >  * Tested getting the status
> >  * Set the contact and description
> >    ceph config set mgr mgr/telemetry/contact 'John Doe
> >    <[email protected]>'
> >    ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
> >    ceph config set mgr mgr/telemetry/channel_ident true
> >  * Tried sending it
> >    ceph telemetry send
> >
> > I *think* this is when the managers died, but it could have been earlier.
> But around then the all ceph IO stopped and I discovered all three managers
> had crashed and would not restart. I was shitting myself because this was
> remote and the router is a pfSense VM :) Fortunately it kept going without
> its disk responding.
> >
> > systemctl start [email protected]
> > Job for [email protected] failed because the control process exited
> with error code.
> > See "systemctl status [email protected]" and "journalctl -xe" for
> details.
> >
> > From journalcontrol -xe
> >
> >    -- The unit [email protected] has entered the 'failed' state with
> >    result 'exit-code'.
> >    Jun 18 21:02:25 vni systemd[1]: Failed to start Ceph cluster manager
> >    daemon.
> >    -- Subject: A start job for unit [email protected] has failed
> >    -- Defined-By: systemd
> >    -- Support: https://www.debian.org/support
> >    --
> >    -- A start job for unit [email protected] has finished with a
> >    failure.
> >    --
> >    -- The job identifier is 91690 and the job result is failed.
> >
> >
> > From systemctl status [email protected]
> >
> > [email protected] - Ceph cluster manager daemon
> >    Loaded: loaded (/lib/systemd/system/[email protected]; enabled;
> vendor
> preset: enabled)
> >   Drop-In: /lib/systemd/system/[email protected]
> >            ??ceph-after-pve-cluster.conf
> >    Active: failed (Result: exit-code) since Thu 2020-06-18 20:53:52 AEST;
> 8min ago
> >   Process: 415566 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER}
> --id vni --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
> >  Main PID: 415566 (code=exited, status=1/FAILURE)
> >
> > Jun 18 20:53:52 vni systemd[1]: [email protected]: Service
> RestartSec=10s expired, scheduling restart.
> > Jun 18 20:53:52 vni systemd[1]: [email protected]: Scheduled restart
> job, restart counter is at 4.
> > Jun 18 20:53:52 vni systemd[1]: Stopped Ceph cluster manager daemon.
> > Jun 18 20:53:52 vni systemd[1]: [email protected]: Start request
> repeated too quickly.
> > Jun 18 20:53:52 vni systemd[1]: [email protected]: Failed with result
> 'exit-code'.
> > Jun 18 20:53:52 vni systemd[1]: Failed to start Ceph cluster manager
> daemon.
> >
> > I created a new manager service on an unused node and fortunately that
> worked. I deleted/recreated the old managers and they started working. It
> was a sweaty few minutes :)
> >
> >
> > Everything resumed without a hiccup after that, impressed. Not game to
> try and reproduce it though.
> >
> >
> >
> > --
> > Lindsay
> >
> > _______________________________________________
> > pve-user mailing list
> > [email protected]
> > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> pve-user mailing list
> [email protected]
> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
>
> ------------------------------
>
> End of pve-user Digest, Vol 147, Issue 10
> *****************************************
>


-- 
С уважением,
Токовенко Алексей Алексеевич
_______________________________________________
pve-user mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] pve-user Digest, Vol 147, Issue 10

Reply via email to