[ceph-users] Re: MGR restart loop

Frank Schilder Tue, 17 Nov 2020 12:06:10 -0800

Addition: This happens only when I stop mon.ceph-01, I can stop any other MON 
daemon without problems. I checked network connectivity and all hosts can see 
all other hosts.


I already increased mon_mgr_beacon_grace to a huge value due to another bug a 
long time ago:

    global advanced mon_mgr_beacon_grace 86400

This restart cycle seems to have another reason. The log contains this line 
just before the MGR goes out:

Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.179 7f7c544ea700  1 mgr 
send_beacon active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.193 7f7c544ea700  0 
log_channel(cluster) log [DBG] : pgmap v4: 3215 pgs: 3208 active+clean, 7 
active+clean+scrubbing+deep; 689 TiB data, 877 TiB used, 1.1 PiB / 1.9 PiB avail
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [INF] : Manager daemon ceph-03 is unresponsive.  No 
standby daemons available.
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.313 7f7bbbb56700  0 
log_channel(cluster) log [DBG] : mgrmap e1330: no daemons active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700 -1 mgr 
handle_mgr_map I was active but no longer am
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700  1 mgr 
respawn  e: '/usr/bin/ceph-mgr'

The beacon has been sent. Why does it not arrive at the MONs? There is only 
little load right now.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <[email protected]>
Sent: 17 November 2020 16:25:36
To: [email protected]
Subject: [ceph-users] MGR restart loop

Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
    id:     xxx
    health: HEALTH_WARN
            1 pools nearfull

  services:
    mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
    mgr: ceph-03(active)
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 302 osds: 281 up, 281 in

  data:
    pools:   11 pools, 3215 pgs
    objects: 334.1 M objects, 689 TiB
    usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     3208 active+clean
             7    active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
    id:     xxx
    health: HEALTH_WARN
            1 pools nearfull
            1/3 mons down, quorum ceph-02,ceph-03

  services:
    mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
    mgr: ceph-03(active)
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 302 osds: 281 up, 281 in

  data:
    pools:   11 pools, 3215 pgs
    objects: 334.1 M objects, 689 TiB
    usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     3207 active+clean
             8    active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
    id:     xxx
    health: HEALTH_WARN
            no active mgr
            1 pools nearfull
            1/3 mons down, quorum ceph-02,ceph-03

  services:
    mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
    mgr: no daemons active
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 302 osds: 281 up, 281 in

  data:
    pools:   11 pools, 3215 pgs
    objects: 334.1 M objects, 689 TiB
    usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     3207 active+clean
             8    active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
    id:     xxx
    health: HEALTH_WARN
            1 pools nearfull
            1/3 mons down, quorum ceph-02,ceph-03

  services:
    mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
    mgr: ceph-03(active, starting)
    mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
    osd: 302 osds: 281 up, 281 in

  data:
    pools:   11 pools, 3215 pgs
    objects: 334.1 M objects, 689 TiB
    usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     3207 active+clean
             8    active+clean+scrubbing+deep

It is cycling through these 3 states and I couldn't find a reason why. The node 
ceph-01 is not special in any way.

Any hint would be greatly appreciated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: MGR restart loop

Reply via email to