Re: [ceph-users] MGR Logs after Failure Testing
You may want to configure your standby-mds's to be "standby-replay" so the mds that's taking over from the failed one takes less time to take over. To manage this you add to your ceph.conf something like this: ---snip--- [mds.server1] mds_standby_replay = true mds_standby_for_rank = 0 [mds.server2] mds_standby_replay = true mds_standby_for_rank = 0 [mds.server3] mds_standby_replay = true mds_standby_for_rank = 0 ---snip--- For your setup this would mean you have one active mds, one as standby-replay (that takes over immediately, depending on the load a very short interruption could happen) and one as standby ("cold standby" if you will). Currently both your standby mds servers are "cold". Zitat von dhils...@performair.com: Eugen; All services are running, yes, though they didn't all start when I brought the host up (configured not to start because the last thing I had done is physically relocate the entire cluster). All services are running, and happy. # ceph status cluster: id: 1a8a1693-fa54-4cb3-89d2-7951d4cee6a3 health: HEALTH_OK services: mon: 3 daemons, quorum S700028,S700029,S700030 (age 20h) mgr: S700028(active, since 17h), standbys: S700029, S700030 mds: cifs:1 {0=S700029=up:active} 2 up:standby osd: 6 osds: 6 up (since 21h), 6 in (since 21h) data: pools: 16 pools, 192 pgs objects: 449 objects, 761 MiB usage: 724 GiB used, 65 TiB / 66 TiB avail pgs: 192 active+clean # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -1 66.17697 root default -5 22.05899 host S700029 2 hdd 11.02950 osd.2up 1.0 1.0 3 hdd 11.02950 osd.3up 1.0 1.0 -7 22.05899 host S700030 4 hdd 11.02950 osd.4up 1.0 1.0 5 hdd 11.02950 osd.5up 1.0 1.0 -3 22.05899 host s700028 0 hdd 11.02950 osd.0up 1.0 1.0 1 hdd 11.02950 osd.1up 1.0 1.0 The question about configuring the MDS as failover struck me as a potential, since I don't remember doing that, however it look like S700029 (10.0.200.111) took over from S700028 (10.0.200.110) as the active MDS. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen Block Sent: Thursday, June 27, 2019 8:23 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] MGR Logs after Failure Testing Hi, some more information about the cluster status would be helpful, such as ceph -s ceph osd tree service status of all MONs, MDSs, MGRs. Are all services up? Did you configure the spare MDS as standby for rank 0 so that a failover can happen? Regards, Eugen Zitat von dhils...@performair.com: All; I built a demonstration and testing cluster, just 3 hosts (10.0.200.110, 111, 112). Each host runs mon, mgr, osd, mds. During the demonstration yesterday, I pulled the power on one of the hosts. After bringing the host back up, I'm getting several error messages every second or so: 2019-06-26 16:01:56.424 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed 6) v7 from mds.? v2:10.0.200.112:6808/980053124 2019-06-26 16:01:56.425 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700030: (2) No such file or directory 2019-06-26 16:01:56.429 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed 1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738 2019-06-26 16:01:56.430 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700029: (2) No such file or directory Thoughts? Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MGR Logs after Failure Testing
Eugen; All services are running, yes, though they didn't all start when I brought the host up (configured not to start because the last thing I had done is physically relocate the entire cluster). All services are running, and happy. # ceph status cluster: id: 1a8a1693-fa54-4cb3-89d2-7951d4cee6a3 health: HEALTH_OK services: mon: 3 daemons, quorum S700028,S700029,S700030 (age 20h) mgr: S700028(active, since 17h), standbys: S700029, S700030 mds: cifs:1 {0=S700029=up:active} 2 up:standby osd: 6 osds: 6 up (since 21h), 6 in (since 21h) data: pools: 16 pools, 192 pgs objects: 449 objects, 761 MiB usage: 724 GiB used, 65 TiB / 66 TiB avail pgs: 192 active+clean # ceph osd tree ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF -1 66.17697 root default -5 22.05899 host S700029 2 hdd 11.02950 osd.2up 1.0 1.0 3 hdd 11.02950 osd.3up 1.0 1.0 -7 22.05899 host S700030 4 hdd 11.02950 osd.4up 1.0 1.0 5 hdd 11.02950 osd.5up 1.0 1.0 -3 22.05899 host s700028 0 hdd 11.02950 osd.0up 1.0 1.0 1 hdd 11.02950 osd.1up 1.0 1.0 The question about configuring the MDS as failover struck me as a potential, since I don't remember doing that, however it look like S700029 (10.0.200.111) took over from S700028 (10.0.200.110) as the active MDS. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eugen Block Sent: Thursday, June 27, 2019 8:23 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] MGR Logs after Failure Testing Hi, some more information about the cluster status would be helpful, such as ceph -s ceph osd tree service status of all MONs, MDSs, MGRs. Are all services up? Did you configure the spare MDS as standby for rank 0 so that a failover can happen? Regards, Eugen Zitat von dhils...@performair.com: > All; > > I built a demonstration and testing cluster, just 3 hosts > (10.0.200.110, 111, 112). Each host runs mon, mgr, osd, mds. > > During the demonstration yesterday, I pulled the power on one of the hosts. > > After bringing the host back up, I'm getting several error messages > every second or so: > 2019-06-26 16:01:56.424 7fcbe0af9700 0 ms_deliver_dispatch: > unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed > 6) v7 from mds.? v2:10.0.200.112:6808/980053124 > 2019-06-26 16:01:56.425 7fcbf4cd1700 1 mgr finish mon failed to > return metadata for mds.S700030: (2) No such file or directory > 2019-06-26 16:01:56.429 7fcbe0af9700 0 ms_deliver_dispatch: > unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed > 1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738 > 2019-06-26 16:01:56.430 7fcbf4cd1700 1 mgr finish mon failed to > return metadata for mds.S700029: (2) No such file or directory > > Thoughts? > > Thank you, > > Dominic L. Hilsbos, MBA > Director - Information Technology > Perform Air International Inc. > dhils...@performair.com > www.PerformAir.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MGR Logs after Failure Testing
Hi, some more information about the cluster status would be helpful, such as ceph -s ceph osd tree service status of all MONs, MDSs, MGRs. Are all services up? Did you configure the spare MDS as standby for rank 0 so that a failover can happen? Regards, Eugen Zitat von dhils...@performair.com: All; I built a demonstration and testing cluster, just 3 hosts (10.0.200.110, 111, 112). Each host runs mon, mgr, osd, mds. During the demonstration yesterday, I pulled the power on one of the hosts. After bringing the host back up, I'm getting several error messages every second or so: 2019-06-26 16:01:56.424 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed 6) v7 from mds.? v2:10.0.200.112:6808/980053124 2019-06-26 16:01:56.425 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700030: (2) No such file or directory 2019-06-26 16:01:56.429 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed 1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738 2019-06-26 16:01:56.430 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700029: (2) No such file or directory Thoughts? Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MGR Logs after Failure Testing
All; I built a demonstration and testing cluster, just 3 hosts (10.0.200.110, 111, 112). Each host runs mon, mgr, osd, mds. During the demonstration yesterday, I pulled the power on one of the hosts. After bringing the host back up, I'm getting several error messages every second or so: 2019-06-26 16:01:56.424 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed 6) v7 from mds.? v2:10.0.200.112:6808/980053124 2019-06-26 16:01:56.425 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700030: (2) No such file or directory 2019-06-26 16:01:56.429 7fcbe0af9700 0 ms_deliver_dispatch: unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed 1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738 2019-06-26 16:01:56.430 7fcbf4cd1700 1 mgr finish mon failed to return metadata for mds.S700029: (2) No such file or directory Thoughts? Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com