Re: [ceph-users] MDS does not always failover to hot standby on reboot

John Spray Tue, 04 Sep 2018 03:17:07 -0700

It's mds_beacon_grace.  Set that on the monitor to control the
replacement of laggy MDS daemons, and usually also set it to the same
value on the MDS daemon as it's used there for the daemon to hold off
on certain tasks if it hasn't seen a mon beacon recently.


John
On Mon, Sep 3, 2018 at 9:26 AM William Lawton <william.law...@irdeto.com> wrote:
>
> Which configuration option determines the MDS timeout period?
>
>
>
> William Lawton
>
>
>
> From: Gregory Farnum <gfar...@redhat.com>
> Sent: Thursday, August 30, 2018 5:46 PM
> To: William Lawton <william.law...@irdeto.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] MDS does not always failover to hot standby on 
> reboot
>
>
>
> Yes, this is a consequence of co-locating the MDS and monitors — if the MDS 
> reports to its co-located monitor and both fail, the monitor cluster has to 
> go through its own failure detection and then wait for a full MDS timeout 
> period after that before it marks the MDS down. :(
>
>
>
> We might conceivably be able to optimize for this, but there's not a general 
> solution. If you need to co-locate, one thing that would make it better 
> without being a lot of work is trying to have the MDS connect to one of the 
> monitors on a different host. You can do that by just restricting the list of 
> monitors you feed it in the ceph.conf, although it's not a guarantee that 
> will *prevent* it from connecting to its own monitor if there are failures or 
> reconnects after first startup.
>
> -Greg
>
> On Thu, Aug 30, 2018 at 8:38 AM William Lawton <william.law...@irdeto.com> 
> wrote:
>
> Hi.
>
>
>
> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). 
> During resiliency tests we have an occasional problem when we reboot the 
> active MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and 
> dub-sitv-ceph-04. We expect the MDS to failover to the standby instance 
> dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does 
> with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN 
> health check is not cleared until 30 seconds later when the rebooted 
> dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.
>
>
>
> When the MDS successfully fails over to the standby we see in the ceph.log 
> the following:
>
>
>
> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 50 : 
> cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 52 : 
> cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem 
> cephfs as rank 0
>
> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 54 : 
> cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is 
> offline)
>
>
>
> When the active MDS role does not failover to the standby the MDS_ALL_DOWN 
> check is not cleared until after the rebooted instances have come back up 
> e.g.:
>
>
>
> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 : 
> cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 
> 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
>
> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 : 
> cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 : 
> cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons 
> dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>
> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 : 
> cluster [WRN] Health check failed: 1/3 mons down, quorum 
> dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>
> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 : 
> cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 
> mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
>
> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 : 
> cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 
> 115 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 : 
> cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects 
> degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 : 
> cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 
> 69 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 : 
> cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data 
> availability: 1 pg inactive, 69 pgs peering)
>
> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 : 
> cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 : 
> cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 
> objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 : 
> cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
>
> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 : 
> cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks 
> not synchronized
>
> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 : 
> cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 
> 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
>
> 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 : 
> cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons 
> dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
>
> 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 : 
> cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum 
> dub-sitv-ceph-03,dub-sitv-ceph-05)
>
> 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 : 
> cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
>
> 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 : 
> cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 
> Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs 
> degraded
>
> 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 : 
> cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
>
> 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 85 : 
> cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
>
> 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 86 : 
> cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 88 : 
> cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem 
> cephfs as rank 0
>
> 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : 
> cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is 
> offline)
>
>
>
> In the MDS log we’ve noticed that when the issue occurs, at precisely the 
> time when the active MDS/MON nodes are rebooted, the standby MDS instance 
> briefly stops logging replay_done (as standby). This is shown in the log 
> exert below where there is a 9s gap in these logs.
>
>
>
> 2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)
>
>
>
> I’ve tried to reproduce the issue by rebooting each MDS instance in turn 
> repeatedly 5 minutes apart but so far haven’t been able to do so, so my 
> assumption is that rebooting the MDS and a MON instance at the same time is a 
> significant factor.
>
>
>
> Our mds_standby* configuration is set as follows:
>
>
>
>     "mon_force_standby_active": "true",
>
>     "mds_standby_for_fscid": "-1",
>
>     "mds_standby_for_name": "",
>
>     "mds_standby_for_rank": "0",
>
>     "mds_standby_replay": "true",
>
>
>
> The cluster status is as follows:
>
>
>
> cluster:
>
>     id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
>
>     health: HEALTH_OK
>
>
>
>   services:
>
>     mon: 3 daemons, quorum dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
>
>     mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, 
> dub-sitv-ceph-05
>
>     mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1 up:standby-replay
>
>     osd: 4 osds: 4 up, 4 in
>
>
>
>   data:
>
>     pools:   2 pools, 200 pgs
>
>     objects: 554  objects, 980 MiB
>
>     usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
>
>     pgs:     200 active+clean
>
>
>
>   io:
>
>     client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr
>
>
>
> Hope someone can help!
>
> William Lawton
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS does not always failover to hot standby on reboot

Reply via email to