Dear cephers,
I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop
the deamons on the server and do the maintenance. Now I'd stuck at the last
server, because MGR fail-over does not work. The remaining MGR instances go
into a restart loop.
In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done
with maintenance. Everything fine. However, as soon as I stop the last MON I
need to do maintenance on, the last remaining MGR goes into a restart loop all
by itself. As far as I can see, the MGR does actually not restart, it just gets
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
7 active+clean+scrubbing+deep
As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
It is cycling through these 3 states and I couldn't find a reason why. The node
ceph-01 is not special in any way.
Any hint would be greatly appreciated.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]