Hi all, I have a ceph cluster running luminous 12.2.5. In the cluster, we configured the cephfs with two MDS server, ceph-mds-test04 is active and ceph-mds-test05 is standby. Here is the MDS configuration:
> [mds] mds_cache_size = 1000000 mds_cache_memory_limit = 42949672960 mds_standby_replay = true mds_beacon_grace = 300 I created 100 million files and want to figure out that how long will it take to do a failover at such scale. At the first time, it failed to failover to ceph-mds-test05 because of timeout > ceph mds fail ceph-mds-test04 Here is the log > 2018-12-18 10:58:38.164369 7fd696a9d700 1 mds.0.6382 handle_mds_map i am > now mds.0.6382 > 2018-12-18 10:58:38.164374 7fd696a9d700 1 mds.0.6382 handle_mds_map state > change up:reconnect --> up:rejoin > 2018-12-18 10:58:38.164394 7fd696a9d700 1 mds.0.6382 rejoin_start > 2018-12-18 11:03:40.583521 7fd697a9f700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:41.490589 7fd693a97700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:45.490645 7fd693a97700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:45.583601 7fd697a9f700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:49.490687 7fd693a97700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:50.583665 7fd697a9f700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:53.490744 7fd693a97700 1 heartbeat_map is_healthy > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:54.996326 7fd696a9d700 1 mds.0.6382 rejoin_joint_start > 2018-12-18 11:03:55.001907 7fd694298700 1 heartbeat_map reset_timeout > 'MDSRank' had timed out after 300 > 2018-12-18 11:03:55.002064 7fd696a9d700 0 > mds.beacon.zw01-data-hadoop-ceph-test05 handle_mds_beacon no longer laggy > 2018-12-18 11:03:56.767320 7fd69aa15700 0 -- 10.130.212.14:6800/543171573 > >> 10.130.213.8:0/2865474356 conn(0x7fd6b2680000 :6800 s=STATE_OPEN > pgs=13123 cs=1 l=0).fault server, going to standby > 2018-12-18 11:03:56.798865 7fd696a9d700 1 > mds.zw01-data-hadoop-ceph-test05 map removed me (mds.-1 gid:824096) from > cluster due to lost contact; respawning > 2018-12-18 11:03:56.798874 7fd696a9d700 1 > mds.zw01-data-hadoop-ceph-test05 respawn But when I trigger failover the second time, it finished immediately > 2018-12-18 11:11:37.704956 7f58061aa700 1 mds.0.6394 reconnect_done 2018-12-18 11:11:38.402853 7f58061aa700 1 mds.0.6394 handle_mds_map i am > now mds.0.6394 2018-12-18 11:11:38.402856 7f58061aa700 1 mds.0.6394 handle_mds_map state > change up:reconnect --> up:rejoin 2018-12-18 11:11:38.402860 7f58061aa700 1 mds.0.6394 rejoin_start 2018-12-18 11:11:38.405550 7f58061aa700 1 mds.0.6394 rejoin_joint_start 2018-12-18 11:11:38.430299 7f58061aa700 1 mds.0.6394 rejoin_done 2018-12-18 11:11:39.486981 7f58061aa700 1 mds.0.6394 handle_mds_map i am > now mds.0.6394 2018-12-18 11:11:39.486984 7f58061aa700 1 mds.0.6394 handle_mds_map state > change up:rejoin --> up:active 2018-12-18 11:11:39.486990 7f58061aa700 1 mds.0.6394 recovery_done -- > successful recovery! 2018-12-18 11:11:39.487131 7f58061aa700 1 mds.0.6394 active_start 2018-12-18 11:11:39.496333 7f58061aa700 1 mds.0.6394 cluster recovered. Would someone explain it? Thanks a lot!
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com