Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
I've been copying happily for days now (not very fast, but the MDS were stable), but eventually the MDSs started flapping again due to large cache sizes (they are being killed after 11M inodes). I could solve the problem by temporarily increasing the cache size in order to allow them to rejoin, but it tells me that my settings do not fully solve the problem yet (unless perhaps I increase the trim threshold even further. On 06.08.19 19:52, Janek Bevendorff wrote: Your parallel rsync job is only getting 150 creates per second? What was the previous throughput? I am actually not quite sure what the exact throughput was or is or what I can expect. It varies so much. I am copying from a 23GB file list that is split into 3000 chunks which are then processed by 16-24 parallel rsync processes. I have copied 27 of 64TB so far (according to df -h) and to my taste it's taking a lot longer than it should be doing. The main problem here is not that I'm trying to copy 64TB (drop in the bucket), the problem is that it's 64TB in tiny, small, and medium-sized files. This whole MDS mess and several pauses and restarts in between have completely distorted my sense of how far in the process I actually am or how fast I would expect it to go. Right now it's starting again from the beginning, so I expect it'll be another day or so until it starts moving some real data again. The cache size looks correct here. Yeah. Cache appears to be constant-size now. I am still getting occasional "client failing to respond to cache pressure", but that goes away as fast as it came. Try pinning if possible in each parallel rsync job. I was considering that, but couldn't come up with a feasible pinning strategy. We have all those files of very different sizes spread very unevenly across a handful of top-level directories. I get the impression that I couldn't do much (or any) better than the automatic balancer. Here are tracker tickets to resolve the issues you encountered: https://tracker.ceph.com/issues/41140 https://tracker.ceph.com/issues/41141 Thanks a lot! ___ Ceph-users mailing list -- ceph-us...@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
> Your parallel rsync job is only getting 150 creates per second? What > was the previous throughput? I am actually not quite sure what the exact throughput was or is or what I can expect. It varies so much. I am copying from a 23GB file list that is split into 3000 chunks which are then processed by 16-24 parallel rsync processes. I have copied 27 of 64TB so far (according to df -h) and to my taste it's taking a lot longer than it should be doing. The main problem here is not that I'm trying to copy 64TB (drop in the bucket), the problem is that it's 64TB in tiny, small, and medium-sized files. This whole MDS mess and several pauses and restarts in between have completely distorted my sense of how far in the process I actually am or how fast I would expect it to go. Right now it's starting again from the beginning, so I expect it'll be another day or so until it starts moving some real data again. > The cache size looks correct here. Yeah. Cache appears to be constant-size now. I am still getting occasional "client failing to respond to cache pressure", but that goes away as fast as it came. > Try pinning if possible in each parallel rsync job. I was considering that, but couldn't come up with a feasible pinning strategy. We have all those files of very different sizes spread very unevenly across a handful of top-level directories. I get the impression that I couldn't do much (or any) better than the automatic balancer. > Here are tracker tickets to resolve the issues you encountered: > > https://tracker.ceph.com/issues/41140 > https://tracker.ceph.com/issues/41141 Thanks a lot! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
On Tue, Aug 6, 2019 at 7:57 AM Janek Bevendorff wrote: > > > > 4k req/s is too fast for a create workload on one MDS. That must > > include other operations like getattr. > > That is rsync going through millions of files checking which ones need > updating. Right now there are not actually any create operations, since > I restarted the copy job. Your parallel rsync job is only getting 150 creates per second? What was the previous throughput? > > I wouldn't expect such extreme latency issues. Please share: > > > > ceph config dump > > ceph daemon mds.X cache status > > Config dump: https://pastebin.com/1jTrjzA9 > > Cache status: > > { > "pool": { > "items": 127688932, > "bytes": 20401092561 > } > } > > > > and the two perf dumps one second apart again please. > Perf dump 1: https://pastebin.com/US3y6JEJ > Perf dump 2: https://pastebin.com/Mm02puje The cache size looks correct here. > > Also, you said you removed the aggressive recall changes. I assume you > > didn't reset them to the defaults, right? Just the first suggested > > change (10k/1.0)? > > Either seems to work. > > I added two more MDSs to split the workload and got a steady 150 reqs/s > after that. Then I noticed that I still had a max segments settings from > one of my earlier attempts at fixing the cache runaway issue and after > removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS). Okay, so you're getting a more normal throughput for parallel creates on a single MDS. > However, to generate the dumps for you, I changed my max_mds setting > back to 1 and reqs/s went down to 80. After re-adding the two active > MDSs again, I am back at higher numbers, although not quite as much as > before. But I think to remember that it took several minutes if not more > until all MDSs received approximately equal load the last time. Try pinning if possible in each parallel rsync job. Here are tracker tickets to resolve the issues you encountered: https://tracker.ceph.com/issues/41140 https://tracker.ceph.com/issues/41141 -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
4k req/s is too fast for a create workload on one MDS. That must include other operations like getattr. That is rsync going through millions of files checking which ones need updating. Right now there are not actually any create operations, since I restarted the copy job. I wouldn't expect such extreme latency issues. Please share: ceph config dump ceph daemon mds.X cache status Config dump: https://pastebin.com/1jTrjzA9 Cache status: { "pool": { "items": 127688932, "bytes": 20401092561 } } and the two perf dumps one second apart again please. Perf dump 1: https://pastebin.com/US3y6JEJ Perf dump 2: https://pastebin.com/Mm02puje Also, you said you removed the aggressive recall changes. I assume you didn't reset them to the defaults, right? Just the first suggested change (10k/1.0)? Either seems to work. I added two more MDSs to split the workload and got a steady 150 reqs/s after that. Then I noticed that I still had a max segments settings from one of my earlier attempts at fixing the cache runaway issue and after removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS). However, to generate the dumps for you, I changed my max_mds setting back to 1 and reqs/s went down to 80. After re-adding the two active MDSs again, I am back at higher numbers, although not quite as much as before. But I think to remember that it took several minutes if not more until all MDSs received approximately equal load the last time. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
On Tue, Aug 6, 2019 at 12:48 AM Janek Bevendorff wrote: > > However, now my client processes are basically in constant I/O wait > > state and the CephFS is slow for everybody. After I restarted the copy > > job, I got around 4k reqs/s and then it went down to 100 reqs/s with > > everybody waiting their turn. So yes, it does seem to help, but it > > increases latency by a magnitude. 4k req/s is too fast for a create workload on one MDS. That must include other operations like getattr. > Addition: I reduced the number to 256K and the cache size started > inflating instantly (with about 140 reqs/s). So I reset it to 512K and > the cache size started reducing slowly, though with fewer reqs/s. > > So I guess it is solving the problem, but only by trading it off against > severe latency issues (order of magnitude as we saw). I wouldn't expect such extreme latency issues. Please share: ceph config dump ceph daemon mds.X cache status and the two perf dumps one second apart again please. Also, you said you removed the aggressive recall changes. I assume you didn't reset them to the defaults, right? Just the first suggested change (10k/1.0)? -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
However, now my client processes are basically in constant I/O wait state and the CephFS is slow for everybody. After I restarted the copy job, I got around 4k reqs/s and then it went down to 100 reqs/s with everybody waiting their turn. So yes, it does seem to help, but it increases latency by a magnitude. Addition: I reduced the number to 256K and the cache size started inflating instantly (with about 140 reqs/s). So I reset it to 512K and the cache size started reducing slowly, though with fewer reqs/s. So I guess it is solving the problem, but only by trading it off against severe latency issues (order of magnitude as we saw). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
Thanks that helps. Looks like the problem is that the MDS is not automatically trimming its cache fast enough. Please try bumping mds_cache_trim_threshold: bin/ceph config set mds mds_cache_trim_threshold 512K That did help. Somewhat. I removed the aggressive recall settings I set before and only set this option instead. The cache size seems to be quite stable now, although still increasing in the long run (but at least not strictly monotonically). However, now my client processes are basically in constant I/O wait state and the CephFS is slow for everybody. After I restarted the copy job, I got around 4k reqs/s and then it went down to 100 reqs/s with everybody waiting their turn. So yes, it does seem to help, but it increases latency by a magnitude. As always, it would be great if these options were documented somewhere. Google has like five results, one of them being this thread. ;-) Increase it further if it's not aggressive enough. Please let us know if that helps. It shouldn't be necessary to do this so I'll make a tracker ticket once we confirm that's the issue. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff wrote: > > Hi, > > > You can also try increasing the aggressiveness of the MDS recall but > > I'm surprised it's still a problem with the settings I gave you: > > > > ceph config set mds mds_recall_max_caps 15000 > > ceph config set mds mds_recall_max_decay_rate 0.75 > > I finally had the chance to try the more aggressive recall settings, but > they did not change anything. As soon as the client starts copying files > again, the numbers go up an I get a health message that the client is > failing to respond to cache pressure. > > After this week of idle time, the dns/inos numbers (what does dns stand > for anyway?) settled at around 8000k. That's basically that "idle" > number that it goes back to when the client stops copying files. Though, > for some weird reason, this number gets (quite) a bit higher every time > (last time it was around 960k). Of course, I wouldn't expect it to go > back all the way to zero, because that would mean dropping the entire > cache for no reason, but it's still quite high and the same after > restarting the MDS and all clients, which doesn't make a lot of sense to > me. After resuming the copy job, the number went up to 20M in just the > time it takes to write this email. There must be a bug somewhere. > > > Can you share two captures of `ceph daemon mds.X perf dump` about 1 > > second apart. > > I attached the requested perf dumps. Thanks that helps. Looks like the problem is that the MDS is not automatically trimming its cache fast enough. Please try bumping mds_cache_trim_threshold: bin/ceph config set mds mds_cache_trim_threshold 512K Increase it further if it's not aggressive enough. Please let us know if that helps. It shouldn't be necessary to do this so I'll make a tracker ticket once we confirm that's the issue. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
Hi, You can also try increasing the aggressiveness of the MDS recall but I'm surprised it's still a problem with the settings I gave you: ceph config set mds mds_recall_max_caps 15000 ceph config set mds mds_recall_max_decay_rate 0.75 I finally had the chance to try the more aggressive recall settings, but they did not change anything. As soon as the client starts copying files again, the numbers go up an I get a health message that the client is failing to respond to cache pressure. After this week of idle time, the dns/inos numbers (what does dns stand for anyway?) settled at around 8000k. That's basically that "idle" number that it goes back to when the client stops copying files. Though, for some weird reason, this number gets (quite) a bit higher every time (last time it was around 960k). Of course, I wouldn't expect it to go back all the way to zero, because that would mean dropping the entire cache for no reason, but it's still quite high and the same after restarting the MDS and all clients, which doesn't make a lot of sense to me. After resuming the copy job, the number went up to 20M in just the time it takes to write this email. There must be a bug somewhere. Can you share two captures of `ceph daemon mds.X perf dump` about 1 second apart. I attached the requested perf dumps. Thanks! perf_dump_1.json Description: application/json perf_dump_2.json Description: application/json ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
I am not sure if making caps recall more aggressive helps. It seems to be the client failing to respond to it (at least that's what the warnings say).But I will try your new suggested settings as soon as I get the chance and will report back with the results. On 25 Jul 2019 11:00 pm, Patrick Donnelly wrote:On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff wrote: > > > > Based on that message, it would appear you still have an inode limit > > in place ("mds_cache_size"). Please unset that config option. Your > > mds_cache_memory_limit is apparently ~19GB. > > No, I do not have an inode limit set. Only the memory limit. > > > > There is another limit mds_max_caps_per_client (default 1M) which the > > client is hitting. That's why the MDS is recalling caps from the > > client and not because any cache memory limit is hit. It is not > > recommend you increase this. > Okay, this this setting isn't documented either and I did not change it, > but it's also quite clear that it isn't working. My MDS hasn't crashed > yet (without the recall settings it would have), but ceph fs status is > reporting 14M inodes at this point and the number is slowly going up. Can you share two captures of `ceph daemon mds.X perf dump` about 1 second apart. You can also try increasing the aggressiveness of the MDS recall but I'm surprised it's still a problem with the settings I gave you: ceph config set mds mds_recall_max_caps 15000 ceph config set mds mds_recall_max_decay_rate 0.75 -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff wrote: > > > > Based on that message, it would appear you still have an inode limit > > in place ("mds_cache_size"). Please unset that config option. Your > > mds_cache_memory_limit is apparently ~19GB. > > No, I do not have an inode limit set. Only the memory limit. > > > > There is another limit mds_max_caps_per_client (default 1M) which the > > client is hitting. That's why the MDS is recalling caps from the > > client and not because any cache memory limit is hit. It is not > > recommend you increase this. > Okay, this this setting isn't documented either and I did not change it, > but it's also quite clear that it isn't working. My MDS hasn't crashed > yet (without the recall settings it would have), but ceph fs status is > reporting 14M inodes at this point and the number is slowly going up. Can you share two captures of `ceph daemon mds.X perf dump` about 1 second apart. You can also try increasing the aggressiveness of the MDS recall but I'm surprised it's still a problem with the settings I gave you: ceph config set mds mds_recall_max_caps 15000 ceph config set mds mds_recall_max_decay_rate 0.75 -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
> Based on that message, it would appear you still have an inode limit > in place ("mds_cache_size"). Please unset that config option. Your > mds_cache_memory_limit is apparently ~19GB. No, I do not have an inode limit set. Only the memory limit. > There is another limit mds_max_caps_per_client (default 1M) which the > client is hitting. That's why the MDS is recalling caps from the > client and not because any cache memory limit is hit. It is not > recommend you increase this. Okay, this this setting isn't documented either and I did not change it, but it's also quite clear that it isn't working. My MDS hasn't crashed yet (without the recall settings it would have), but ceph fs status is reporting 14M inodes at this point and the number is slowly going up. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
On Thu, Jul 25, 2019 at 3:08 AM Janek Bevendorff wrote: > > The rsync job has been copying quite happily for two hours now. The good > news is that the cache size isn't increasing unboundedly with each > request anymore. The bad news is that it still is increasing afterall, > though much slower. I am at 3M inodes now and it started off with 900k, > settling at 1M initially. I had a peak just now of 3.7M, but it went > back down to 3.2M shortly after that. > > According to the health status, the client has started failing to > respond to cache pressure, so it's still not working as reliably as I > would like it to. I am also getting this very peculiar message: > > MDS cache is too large (7GB/19GB); 52686 inodes in use by clients > > I guess the 53k inodes is the number that is actively in use right now > (compared to the 3M for which the client generally holds caps). Is that > so? Cache memory is still well within bounds, however. Perhaps the > message is triggered by the recall settings and just a bit misleading? Based on that message, it would appear you still have an inode limit in place ("mds_cache_size"). Please unset that config option. Your mds_cache_memory_limit is apparently ~19GB. There is another limit mds_max_caps_per_client (default 1M) which the client is hitting. That's why the MDS is recalling caps from the client and not because any cache memory limit is hit. It is not recommend you increase this. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
It's possible the MDS is not being aggressive enough with asking the single (?) client to reduce its cache size. There were recent changes [1] to the MDS to improve this. However, the defaults may not be aggressive enough for your client's workload. Can you try: ceph config set mds mds_recall_max_caps 1 ceph config set mds mds_recall_max_decay_rate 1.0 Thank you. I was looking for config directives that do exactly this all week. Why are they not documented anywhere outside that blog post? I added them as you described and the MDS seems to have stabilized and stays just under 1M inos now. I will continue to monitor it and see if it is working in the long run. Settings like these should be the default IMHO. Clients should never be able to crash the server just by holding onto their capabilities. If a server decides to drop things from its cache, clients must deal with it. Everything else threatens the stability of the system (and may even prevent the MDS from ever starting again, as we saw). Also your other mailings made me think you may still be using the old inode limit for the cache size. Are you using the new mds_cache_memory_limit config option? No, I am not. I tried it at some point to see if it made things better, but just like the memory cache limit, it seemed to have no effect whatsoever except for delaying the health warning. Finally, if this fixes your issue (please let us know!) and you decide to try multiple active MDS, you should definitely use pinning as the parallel create workload will greatly benefit from it. I will try that, although I directory tree is quite imbalanced. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes
+ other ceph-users On Wed, Jul 24, 2019 at 10:26 AM Janek Bevendorff wrote: > > > what's the ceph.com mailing list? I wondered whether this list is dead but > > it's the list announced on the official ceph.com homepage, isn't it? > There are two mailing lists announced on the website. If you go to > https://ceph.com/resources/ you will find the > subscribe/unsubscribe/archive links for the (much more active) ceph.com > MLs. But if you click on "Mailing Lists & IRC page" you will get to a > page where you can subscribe to this list, which is different. Very > confusing. It is confusing. This is supposed to be the new ML but I don't think the migration has started yet. > > What did you have the MDS cache size set to at the time? > > > > < and an inode count between > > I actually did not think I'd get a reply here. We are a bit further than > this on the other mailing list. This is the thread: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036095.html > > To sum it up: the ceph client prevents the MDS from freeing its cache, > so inodes keep piling up until either the MDS becomes too slow (fixable > by increasing the beacon grace time) or runs out of memory. The latter > will happen eventually. In the end, my MDSs couldn't even rejoin because > they hit the host's 128GB memory limit and crashed. It's possible the MDS is not being aggressive enough with asking the single (?) client to reduce its cache size. There were recent changes [1] to the MDS to improve this. However, the defaults may not be aggressive enough for your client's workload. Can you try: ceph config set mds mds_recall_max_caps 1 ceph config set mds mds_recall_max_decay_rate 1.0 Also your other mailings made me think you may still be using the old inode limit for the cache size. Are you using the new mds_cache_memory_limit config option? Finally, if this fixes your issue (please let us know!) and you decide to try multiple active MDS, you should definitely use pinning as the parallel create workload will greatly benefit from it. [1] https://ceph.com/community/nautilus-cephfs/ -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com