You've probably run in to http://tracker.ceph.com/issues/16010 — do you have very large directories? (Or perhaps just a whole bunch of unlinked files which the MDS hasn't managed to trim yet?)
On Tue, Sep 19, 2017 at 11:51 AM Christian Salzmann-Jäckel < [email protected]> wrote: > Hi, > > we run cephfs (10.2.9 on Debian jessie; 108 OSDs on 9 nodes) as scratch > filesystem for a HPC cluster using IPoIB interconnect with kernel client > (Debian backports kernel version 4.9.30). > > Our clients started blocking on file system access. > Logs show 'mds0: Behind on trimming' and slow requests to one osd > (osd.049). > Replacing the disk of osd.049 didn't show any effect. Clust health is ok. > > 'ceph daemon mds.cephmon1 dump_ops_in_flight' shows ops from client > sessions which are no longer present according to 'ceph daemon mds.cephmon1 > session ls'. > > We observe traffic of ~200 Mbps on the mds node and this OSD (osd.049). > Stopping the mds process ends the traffic (of course). > Stopping osd.049 shifts traffic to the next OSD (osd.095). > ceph logs show 'slow requests' even after stopping almost all clients. > > Debug log on osd.049 show zillions of lines of a single pg (4.22e) of the > cephfs_metadata pool which resides on OSDs [49, 95, 9]. > > 2017-09-19 12:20:08.535383 7fd6b98c3700 20 osd.49 pg_epoch: 240725 > pg[4.22e( v 240141'1432046 (239363'1429042,240141'1432046] local-les=240073 > n=4848 ec=451 les/c/f 240073/240073/0 239916/240072/240072) [49,95,9] r=0 > lpr=240072 crt=240129'1432044 lcod 240130'1432045 mlcod 240130'1432045 > active+clean] Found key .chunk_4761369_head > > Is there anything we can do to get the mds back into operation? > > ciao > Christian > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
