[ceph-users] MDS failover very slow the first time, but very fast at second time

Ch Wan Mon, 17 Dec 2018 19:49:21 -0800

Hi all, I have a ceph cluster running luminous 12.2.5.
In the cluster, we configured the cephfs with two MDS server,
ceph-mds-test04 is active and ceph-mds-test05 is standby.
Here is the MDS configuration:


> [mds]

mds_cache_size = 1000000

mds_cache_memory_limit = 42949672960

mds_standby_replay = true

mds_beacon_grace = 300



 I created 100 million files and want to figure out that how long will it
take to do a failover at such scale.
At the first time, it failed to failover to ceph-mds-test05 because of
timeout

> ceph mds fail ceph-mds-test04


Here is the log

> 2018-12-18 10:58:38.164369 7fd696a9d700  1 mds.0.6382 handle_mds_map i am
> now mds.0.6382
> 2018-12-18 10:58:38.164374 7fd696a9d700  1 mds.0.6382 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2018-12-18 10:58:38.164394 7fd696a9d700  1 mds.0.6382 rejoin_start
> 2018-12-18 11:03:40.583521 7fd697a9f700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:41.490589 7fd693a97700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:45.490645 7fd693a97700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:45.583601 7fd697a9f700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:49.490687 7fd693a97700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:50.583665 7fd697a9f700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:53.490744 7fd693a97700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:54.996326 7fd696a9d700  1 mds.0.6382 rejoin_joint_start
> 2018-12-18 11:03:55.001907 7fd694298700  1 heartbeat_map reset_timeout
> 'MDSRank' had timed out after 300
> 2018-12-18 11:03:55.002064 7fd696a9d700  0
> mds.beacon.zw01-data-hadoop-ceph-test05 handle_mds_beacon no longer laggy
> 2018-12-18 11:03:56.767320 7fd69aa15700  0 -- 10.130.212.14:6800/543171573
> >> 10.130.213.8:0/2865474356 conn(0x7fd6b2680000 :6800 s=STATE_OPEN
> pgs=13123 cs=1 l=0).fault server, going to standby
> 2018-12-18 11:03:56.798865 7fd696a9d700  1
> mds.zw01-data-hadoop-ceph-test05 map removed me (mds.-1 gid:824096) from
> cluster due to lost contact; respawning
> 2018-12-18 11:03:56.798874 7fd696a9d700  1
> mds.zw01-data-hadoop-ceph-test05 respawn


But when I trigger failover the second time, it finished immediately

> 2018-12-18 11:11:37.704956 7f58061aa700  1 mds.0.6394 reconnect_done

2018-12-18 11:11:38.402853 7f58061aa700  1 mds.0.6394 handle_mds_map i am
> now mds.0.6394

2018-12-18 11:11:38.402856 7f58061aa700  1 mds.0.6394 handle_mds_map state
> change up:reconnect --> up:rejoin

2018-12-18 11:11:38.402860 7f58061aa700  1 mds.0.6394 rejoin_start

2018-12-18 11:11:38.405550 7f58061aa700  1 mds.0.6394 rejoin_joint_start

2018-12-18 11:11:38.430299 7f58061aa700  1 mds.0.6394 rejoin_done

2018-12-18 11:11:39.486981 7f58061aa700  1 mds.0.6394 handle_mds_map i am
> now mds.0.6394

2018-12-18 11:11:39.486984 7f58061aa700  1 mds.0.6394 handle_mds_map state
> change up:rejoin --> up:active

2018-12-18 11:11:39.486990 7f58061aa700  1 mds.0.6394 recovery_done --
> successful recovery!

2018-12-18 11:11:39.487131 7f58061aa700  1 mds.0.6394 active_start

2018-12-18 11:11:39.496333 7f58061aa700  1 mds.0.6394 cluster recovered.


Would someone explain it? Thanks a lot!

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS failover very slow the first time, but very fast at second time

Reply via email to