Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

That did help. Somewhat. I removed the aggressive recall settings I set before and only set this option instead. The cache size seems to be quite stable now, although still increasing in the long run (but at least not strictly monotonically).

However, now my client processes are basically in constant I/O wait state and the CephFS is slow for everybody. After I restarted the copy job, I got around 4k reqs/s and then it went down to 100 reqs/s with everybody waiting their turn. So yes, it does seem to help, but it increases latency by a magnitude.

As always, it would be great if these options were documented somewhere. Google has like five results, one of them being this thread. ;-)


Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to