You've probably run in to http://tracker.ceph.com/issues/16010 — do you
have very large directories? (Or perhaps just a whole bunch of unlinked
files which the MDS hasn't managed to trim yet?)

On Tue, Sep 19, 2017 at 11:51 AM Christian Salzmann-Jäckel <
[email protected]> wrote:

> Hi,
>
> we run cephfs  (10.2.9 on Debian jessie; 108 OSDs on 9 nodes) as scratch
> filesystem for a HPC cluster using IPoIB interconnect with kernel client
> (Debian backports kernel version 4.9.30).
>
> Our clients started blocking on file system access.
> Logs show 'mds0: Behind on trimming' and slow requests to one osd
> (osd.049).
> Replacing the disk of osd.049 didn't show any effect. Clust health is ok.
>
> 'ceph daemon mds.cephmon1 dump_ops_in_flight' shows ops from client
> sessions which are no longer present according to 'ceph daemon mds.cephmon1
> session ls'.
>
> We observe traffic of ~200 Mbps on the mds node and this OSD (osd.049).
> Stopping the mds process ends the traffic (of course).
> Stopping osd.049 shifts traffic to the next OSD (osd.095).
> ceph logs show 'slow requests' even after stopping almost all clients.
>
> Debug log on osd.049 show zillions of lines of a single pg (4.22e) of the
> cephfs_metadata pool which resides on OSDs [49, 95, 9].
>
> 2017-09-19 12:20:08.535383 7fd6b98c3700 20 osd.49 pg_epoch: 240725
> pg[4.22e( v 240141'1432046 (239363'1429042,240141'1432046] local-les=240073
> n=4848 ec=451 les/c/f 240073/240073/0 239916/240072/240072) [49,95,9] r=0
> lpr=240072 crt=240129'1432044 lcod 240130'1432045 mlcod 240130'1432045
> active+clean] Found key .chunk_4761369_head
>
> Is there anything we can do to get the mds back into operation?
>
> ciao
> Christian
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to