Thanks for the tips, John. I'll increase the debug level as suggested. On 25 Feb 2018 20:56, "John Spray" <[email protected]> wrote:
> On Sat, Feb 24, 2018 at 10:13 AM, David C <[email protected]> wrote: > > Hi All > > > > I had an MDS go down on a 12.2.1 cluster, the standby took over but I > don't > > know what caused the issue. Scrubs are scheduled to start at 23:00 on > this > > cluster but this appears to have started a minute before. > > > > Can anyone help me with diagnosing this please. Here's the relevant bit > from > > the MDS log: > > The messages about the heartbeat map not being healthy are a sign that > somewhere in the MDS a thread is getting stuck and not letting others > get in there to do work. The daemon responds to that by stopping > sending beacons to the monitors, who in turn blacklist the misbehaving > MDS daemon. > > You'll have a better shot at working out what got jammed up if "debug > mds" is set to something like 7, or if this is happening predictably > at 22:59:30 you could even attach gdb to the running process and grab > a backtrace of all threads. > > John > > > 2018-02-23 22:59:30.702915 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:32.960228 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:34.703001 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:342018-02-23 22:59:02.702284 7f26e0612700 1 > heartbeat_map > > is_healthy 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:02.702334 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:02.959726 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:06.702354 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:06.702366 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:07.959804 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:10.702421 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:10.702434 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:12.959876 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:14.702522 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:14.702535 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:17.959985 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:18.702645 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:18.702670 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:22.702742 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:22.702754 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:22.960063 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:26.702841 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:26.702854 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:27.960141 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:30.702903 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > .703014 7f26e0612700 1 mds.beacon.mdshostname _send skipping beacon, > > heartbeat map not healthy > > 2018-02-23 22:59:37.960301 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:38.703063 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:38.703075 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:42.703147 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:42.703160 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:42.960414 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:46.703209 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:46.703222 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:47.960487 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:50.703305 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:50.703319 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:52.960569 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:54.703365 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:54.703377 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:57.960642 7f26e461a700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:58.703447 7f26e0612700 1 heartbeat_map is_healthy > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:58.703461 7f26e0612700 1 mds.beacon.mdshostname _send > > skipping beacon, heartbeat map not healthy > > 2018-02-23 22:59:59.717665 7f26e0e13700 1 heartbeat_map reset_timeout > > 'MDSRank' had timed out after 15 > > 2018-02-23 22:59:59.719194 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) > > _finish_write_head got (108) Cannot send after transport endpoint > shutdown > > 2018-02-23 22:59:59.719215 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) > > handle_write_error (108) Cannot send after transport endpoint shutdown > > 2018-02-23 22:59:59.719223 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) > > _finish_flush got (108) Cannot send after transport endpoint shutdown > > 2018-02-23 22:59:59.719228 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) > > handle_write_error (108) Cannot send after transport endpoint shutdown > > 2018-02-23 22:59:59.719232 7f26dd60c700 -1 mds.0.journaler.mdlog(rw) > > handle_write_error: multiple write errors, handler already called > > 2018-02-23 22:59:59.719240 7f26dd60c700 -1 MDSIOContextBase: blacklisted! > > Restarting... > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
