Quoting Patrick Donnelly ([email protected]):
> On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman <[email protected]> wrote:
> >
> > Dear list,
> >
> > After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
> > (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
> > more that 5 seconds. We have one active, and one active standby MDS. All
> > storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
> > reported. The MDSes are underutilized, only a handful of active clients
> > and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
> > cluster is also far from busy.
> >
> > I've dumped ops in flight on the MDSes but all ops that are printed are
> > finished in a split second (duration: 0.000152), flag_point": "acquired
> > locks".
>
> I believe you're looking at the wrong "ops" dump. You want to check
> "objector_requests".
I've done that, but not much timing info in there, i.e.:
"ops": [
{
"tid": 416231,
"pg": "6.791ce4c5",
"osd": 31,
"object_id": "200.00024d0f",
"object_locator": "@6",
"target_object_id": "200.00024d0f",
"target_object_locator": "@6",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "79790s",
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-03-01 16:39:29.0.41847s",
"osd_ops": [
"write 798204~6249 [fadvise_dontneed]"
]
}
The MDS_SLOW_METADATA_IO warning has been added in commit
https://github.com/ceph/ceph/commit/0f735f40315448560fde049ed3ea019a7d30d868#diff-74784f821b1aae68f768965680914268
Tracker issue: http://tracker.ceph.com/issues/24879
PR: 23022
It uses the "osd_op_complaint_time" as the "complaint_time". We have set
the osd_op_complaint_time to 5 seconds.
The tracker issue hints that slow metadata would be caused by a slow OSD
(joural writes of MDS that take too long). However, we do not see any
slow mds ops (data), nor slow ops. We've changed the parameter
"mon_osd_warn_op_age" to 5 seconds as well so slow OSD ops would be
reported as well.
I do not think our OSDs are that slow, that often (~ at least every 15
minutes we would see that warning) ... so I really wonder what's going
on.
For now we've set the osd_op_complaint_time to "30" on the MDSes and do
not see any SLOW METADATA warnings anymore.
Anyone willing to set the osd_op_complaint_time to 5 seconds to see if
those messages will start popping up? ceph tell mds.* config set
osd_op_complaint_time 5 should do the trick ... you would need to be
running 12.2.11 and /or mimic 13.2.2.
Thanks,
Stefan
P.s. kudos to Wido den Hollander for helping us debug this.
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / [email protected]
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com