[ceph-users] CephFS | flapping OSD locked up NFS

David Mon, 19 Jun 2017 07:46:58 -0700

Hi All

We had a faulty OSD that was going up and down for a few hours until Ceph
marked it out. During this time Cephfs was accessible, however, for about
10 mins all NFS processes (kernel NFSv3) on a server exporting Cephfs were
hung, locking up all the NFS clients. The cluster was healthy before the
faulty OSD. I'm trying to understand if this is expected behaviour, a bug
or something else. Any insights would be appreciated.


MDS active/passive
Jewel 10.2.2
Ceph client 3.10.0-514.6.1.el7.x86_64
Cephfs mount: (rw,relatime,name=admin,secret=<hidden>,acl)

I can see some slow requests in the MDS log during the time the NFS
processes were hung, some for setattr calls:

2017-06-15 04:29:37.081175 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 60.974528 seconds old, received at 2017-06-15 04:
28:36.106598: client_request(client.2622511:116375892 setattr size=0
#100025b3554 2017-06-15 04:28:36.104928) currently acquired locks

and some for getattr:

2017-06-15 04:29:42.081224 7f889401f700  0 log_channel(cluster) log [WRN] :
slow request 32.225883 seconds old, received at 2017-06-15 04:
29:09.855302: client_request(client.2622511:116380541 getattr pAsLsXsFs
#100025b4d37 2017-06-15 04:29:09.853772) currently failed to rdloc
k, waiting

And a "client not responding to mclientcaps revoke" warning:

2017-06-15 04:31:12.084561 7f889401f700  0 log_channel(cluster) log [WRN] :
client.2344872 isn't responding to mclientcaps(revoke), ino 100025b4d37
pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 122.229172 seconds ag

These issues seemed to have cleared once the faulty OSD was marked out.

In general I have noticed the NFS processes exporting Cephfs do seem to
spend a lot of time in 'D' state, with WCHAN as 'lock_page', compared with
a NFS server exporting a local file system. Also, NFS performance hasn't
been great with small reads/writes, particularly writes with the default
sync export option, I've had to export with async for the time-being. I
haven't had a chance to troubleshoot this in any depth yet, just mentioning
in case it's relevant.

Thanks,
David

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS | flapping OSD locked up NFS

Reply via email to