Thanks for the tips, John. I'll increase the debug level as suggested.

On 25 Feb 2018 20:56, "John Spray" <[email protected]> wrote:

> On Sat, Feb 24, 2018 at 10:13 AM, David C <[email protected]> wrote:
> > Hi All
> >
> > I had an MDS go down on a 12.2.1 cluster, the standby took over but I
> don't
> > know what caused the issue. Scrubs are scheduled to start at 23:00 on
> this
> > cluster but this appears to have started a minute before.
> >
> > Can anyone help me with diagnosing this please. Here's the relevant bit
> from
> > the MDS log:
>
> The messages about the heartbeat map not being healthy are a sign that
> somewhere in the MDS a thread is getting stuck and not letting others
> get in there to do work.  The daemon responds to that by stopping
> sending beacons to the monitors, who in turn blacklist the misbehaving
> MDS daemon.
>
> You'll have a better shot at working out what got jammed up if "debug
> mds" is set to something like 7, or if this is happening predictably
> at 22:59:30 you could even attach gdb to the running process and grab
> a backtrace of all threads.
>
> John
>
> > 2018-02-23 22:59:30.702915 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:32.960228 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:34.703001 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:342018-02-23 22:59:02.702284 7f26e0612700  1
> heartbeat_map
> > is_healthy 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:02.702334 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:02.959726 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:06.702354 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:06.702366 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:07.959804 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:10.702421 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:10.702434 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:12.959876 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:14.702522 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:14.702535 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:17.959985 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:18.702645 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:18.702670 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:22.702742 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:22.702754 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:22.960063 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:26.702841 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:26.702854 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:27.960141 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:30.702903 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > .703014 7f26e0612700  1 mds.beacon.mdshostname _send skipping beacon,
> > heartbeat map not healthy
> > 2018-02-23 22:59:37.960301 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:38.703063 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:38.703075 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:42.703147 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:42.703160 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:42.960414 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:46.703209 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:46.703222 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:47.960487 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:50.703305 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:50.703319 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:52.960569 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:54.703365 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:54.703377 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:57.960642 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:58.703447 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:58.703461 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:59.717665 7f26e0e13700  1 heartbeat_map reset_timeout
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:59.719194 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> > _finish_write_head got (108) Cannot send after transport endpoint
> shutdown
> > 2018-02-23 22:59:59.719215 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> > handle_write_error (108) Cannot send after transport endpoint shutdown
> > 2018-02-23 22:59:59.719223 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> > _finish_flush got (108) Cannot send after transport endpoint shutdown
> > 2018-02-23 22:59:59.719228 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> > handle_write_error (108) Cannot send after transport endpoint shutdown
> > 2018-02-23 22:59:59.719232 7f26dd60c700 -1 mds.0.journaler.mdlog(rw)
> > handle_write_error: multiple write errors, handler already called
> > 2018-02-23 22:59:59.719240 7f26dd60c700 -1 MDSIOContextBase: blacklisted!
> > Restarting...
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to