Re: [ceph-users] cephfs/ceph-fuse: mds0: Client XXX:XXX failingtorespond to capability release

Dennis Kramer (DT) Wed, 14 Sep 2016 05:56:53 -0700

Hi Burkhard,

Thank you for your reply, see inline:


On Wed, 14 Sep 2016, Burkhard Linke wrote:

Hi,


On 09/14/2016 12:43 PM, Dennis Kramer (DT) wrote:
Hi Goncalo,
Thank you. Yes, i have seen that thread, but I have no near full osds andmy mds cache size is pretty high.
You can use the daemon socket on the mds server to get an overview of thecurrent cache state:
ceph daemon mds.XXX perf dump
The message itself indicates that the mds is in fact trying to convinceclients to release capabilities, probably because it is running out of cache.

My cache is set to mds_cache_size = 15000000, but you are right, it seemsthe complete cache is used, but that shouldn't be a real problem if theclients can release the caps in time. Correct me if i'm wrong but thecache_size is pretty high compared to the default (100k). I will raise themds_cache_size a bit and see if it helps a bit.

The 'session ls' command on the daemon socket lists all current ceph clientsand the number capabilities for each client. Depending on your workload /applications you might be surprised how many capabilities are assigned toindividual nodes...
From the client side of view the error means that there's either a bug in theclient, or an application is keeping a large number of files open (e.g. doyou run mlocate on the clients?)

I haven't had this issue when I was on hammer and the amount of clientshaven't changed. I have "ceph fuse.ceph fuse.ceph-fuse" in my PRUNEFS forupdatedb, so it probably isn't mlocate which would cause this issue.

The only real difference is my upgrade to Jewel.

If you use the kernel based client re-mounting won't help, since the internalstate is keep the same (afaik). In case of the ceph-fuse client the ugly wayto get rid off the mount point is a lazy / forced umount and killing theceph-fuse process if necessary. Processes with open file handles willcomplain afterwards.
Before using rude ways to terminate the client session i would propose tolook for rogue applications on the involved host. We had a number of problemswith multithreaded applications and concurrent file access on the past (bothwith ceph-fuse from hammer and kernel based clients). lsof or other toolsmight help locating the application.

My cluster is back to HEALTH_OK, the involved host has been restarted bythe user. But I will debug some more on the host when i see this issueagain next time.

PS: For completeness, i've stated that this issue was often seen in mycurrent Jewel environment, I meant to say that this issue comes upsometimes (so not so often). But the times when i *do* have this issue, it blocks someI/O for clients as a consequence.

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs/ceph-fuse: mds0: Client XXX:XXX failingtorespond to capability release

Reply via email to