Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-12 Thread Janek Bevendorff
I've been copying happily for days now (not very fast, but the MDS were 
stable), but eventually the MDSs started flapping again due to large 
cache sizes (they are being killed after 11M inodes). I could solve the 
problem by temporarily increasing the cache size in order to allow them 
to rejoin, but it tells me that my settings do not fully solve the 
problem yet (unless perhaps I increase the trim threshold even further.



On 06.08.19 19:52, Janek Bevendorff wrote:

Your parallel rsync job is only getting 150 creates per second? What
was the previous throughput?

I am actually not quite sure what the exact throughput was or is or what
I can expect. It varies so much. I am copying from a 23GB file list that
is split into 3000 chunks which are then processed by 16-24 parallel
rsync processes. I have copied 27 of 64TB so far (according to df -h)
and to my taste it's taking a lot longer than it should be doing. The
main problem here is not that I'm trying to copy 64TB (drop in the
bucket), the problem is that it's 64TB in tiny, small, and medium-sized
files.

This whole MDS mess and several pauses and restarts in between have
completely distorted my sense of how far in the process I actually am or
how fast I would expect it to go. Right now it's starting again from the
beginning, so I expect it'll be another day or so until it starts moving
some real data again.


The cache size looks correct here.

Yeah. Cache appears to be constant-size now. I am still getting
occasional "client failing to respond to cache pressure", but that goes
away as fast as it came.



Try pinning if possible in each parallel rsync job.

I was considering that, but couldn't come up with a feasible pinning
strategy. We have all those files of very different sizes spread very
unevenly across a handful of top-level directories. I get the impression
that I couldn't do much (or any) better than the automatic balancer.



Here are tracker tickets to resolve the issues you encountered:

https://tracker.ceph.com/issues/41140
https://tracker.ceph.com/issues/41141

Thanks a lot!
___
Ceph-users mailing list -- ceph-us...@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Janek Bevendorff


> Your parallel rsync job is only getting 150 creates per second? What
> was the previous throughput?

I am actually not quite sure what the exact throughput was or is or what
I can expect. It varies so much. I am copying from a 23GB file list that
is split into 3000 chunks which are then processed by 16-24 parallel
rsync processes. I have copied 27 of 64TB so far (according to df -h)
and to my taste it's taking a lot longer than it should be doing. The
main problem here is not that I'm trying to copy 64TB (drop in the
bucket), the problem is that it's 64TB in tiny, small, and medium-sized
files.

This whole MDS mess and several pauses and restarts in between have
completely distorted my sense of how far in the process I actually am or
how fast I would expect it to go. Right now it's starting again from the
beginning, so I expect it'll be another day or so until it starts moving
some real data again.

> The cache size looks correct here.

Yeah. Cache appears to be constant-size now. I am still getting
occasional "client failing to respond to cache pressure", but that goes
away as fast as it came.


> Try pinning if possible in each parallel rsync job.

I was considering that, but couldn't come up with a feasible pinning
strategy. We have all those files of very different sizes spread very
unevenly across a handful of top-level directories. I get the impression
that I couldn't do much (or any) better than the automatic balancer.


> Here are tracker tickets to resolve the issues you encountered:
>
> https://tracker.ceph.com/issues/41140
> https://tracker.ceph.com/issues/41141

Thanks a lot!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Patrick Donnelly
On Tue, Aug 6, 2019 at 7:57 AM Janek Bevendorff
 wrote:
>
>
> > 4k req/s is too fast for a create workload on one MDS. That must
> > include other operations like getattr.
>
> That is rsync going through millions of files checking which ones need
> updating. Right now there are not actually any create operations, since
> I restarted the copy job.

Your parallel rsync job is only getting 150 creates per second? What
was the previous throughput?

> > I wouldn't expect such extreme latency issues. Please share:
> >
> > ceph config dump
> > ceph daemon mds.X cache status
>
> Config dump: https://pastebin.com/1jTrjzA9
>
> Cache status:
>
> {
>  "pool": {
>  "items": 127688932,
>  "bytes": 20401092561
>  }
> }
>
>
> > and the two perf dumps one second apart again please.
> Perf dump 1: https://pastebin.com/US3y6JEJ
> Perf dump 2: https://pastebin.com/Mm02puje

The cache size looks correct here.

> > Also, you said you removed the aggressive recall changes. I assume you
> > didn't reset them to the defaults, right? Just the first suggested
> > change (10k/1.0)?
>
> Either seems to work.
>
> I added two more MDSs to split the workload and got a steady 150 reqs/s
> after that. Then I noticed that I still had a max segments settings from
> one of my earlier attempts at fixing the cache runaway issue and after
> removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS).

Okay, so you're getting a more normal throughput for parallel creates
on a single MDS.

> However, to generate the dumps for you, I changed my max_mds setting
> back to 1 and reqs/s went down to 80. After re-adding the two active
> MDSs again, I am back at higher numbers, although not quite as much as
> before. But I think to remember that it took several minutes if not more
> until all MDSs received approximately equal load the last time.

Try pinning if possible in each parallel rsync job.

Here are tracker tickets to resolve the issues you encountered:

https://tracker.ceph.com/issues/41140
https://tracker.ceph.com/issues/41141

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Janek Bevendorff



4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.


That is rsync going through millions of files checking which ones need 
updating. Right now there are not actually any create operations, since 
I restarted the copy job.



I wouldn't expect such extreme latency issues. Please share:

ceph config dump
ceph daemon mds.X cache status


Config dump: https://pastebin.com/1jTrjzA9

Cache status:

{
    "pool": {
    "items": 127688932,
    "bytes": 20401092561
    }
}



and the two perf dumps one second apart again please.

Perf dump 1: https://pastebin.com/US3y6JEJ
Perf dump 2: https://pastebin.com/Mm02puje



Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?


Either seems to work.

I added two more MDSs to split the workload and got a steady 150 reqs/s 
after that. Then I noticed that I still had a max segments settings from 
one of my earlier attempts at fixing the cache runaway issue and after 
removing that, I got 250-500 reqs/s, sometimes up to 1.5k (per MDS).


However, to generate the dumps for you, I changed my max_mds setting 
back to 1 and reqs/s went down to 80. After re-adding the two active 
MDSs again, I am back at higher numbers, although not quite as much as 
before. But I think to remember that it took several minutes if not more 
until all MDSs received approximately equal load the last time.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Patrick Donnelly
On Tue, Aug 6, 2019 at 12:48 AM Janek Bevendorff
 wrote:
> > However, now my client processes are basically in constant I/O wait
> > state and the CephFS is slow for everybody. After I restarted the copy
> > job, I got around 4k reqs/s and then it went down to 100 reqs/s with
> > everybody waiting their turn. So yes, it does seem to help, but it
> > increases latency by a magnitude.

4k req/s is too fast for a create workload on one MDS. That must
include other operations like getattr.

> Addition: I reduced the number to 256K and the cache size started
> inflating instantly (with about 140 reqs/s). So I reset it to 512K and
> the cache size started reducing slowly, though with fewer reqs/s.
>
> So I guess it is solving the problem, but only by trading it off against
> severe latency issues (order of magnitude as we saw).

I wouldn't expect such extreme latency issues. Please share:

ceph config dump
ceph daemon mds.X cache status

and the two perf dumps one second apart again please.

Also, you said you removed the aggressive recall changes. I assume you
didn't reset them to the defaults, right? Just the first suggested
change (10k/1.0)?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Janek Bevendorff



However, now my client processes are basically in constant I/O wait 
state and the CephFS is slow for everybody. After I restarted the copy 
job, I got around 4k reqs/s and then it went down to 100 reqs/s with 
everybody waiting their turn. So yes, it does seem to help, but it 
increases latency by a magnitude.


Addition: I reduced the number to 256K and the cache size started 
inflating instantly (with about 140 reqs/s). So I reset it to 512K and 
the cache size started reducing slowly, though with fewer reqs/s.


So I guess it is solving the problem, but only by trading it off against 
severe latency issues (order of magnitude as we saw).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-06 Thread Janek Bevendorff




Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K


That did help. Somewhat. I removed the aggressive recall settings I set 
before and only set this option instead. The cache size seems to be 
quite stable now, although still increasing in the long run (but at 
least not strictly monotonically).


However, now my client processes are basically in constant I/O wait 
state and the CephFS is slow for everybody. After I restarted the copy 
job, I got around 4k reqs/s and then it went down to 100 reqs/s with 
everybody waiting their turn. So yes, it does seem to help, but it 
increases latency by a magnitude.


As always, it would be great if these options were documented somewhere. 
Google has like five results, one of them being this thread. ;-)




Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-05 Thread Patrick Donnelly
On Mon, Aug 5, 2019 at 12:21 AM Janek Bevendorff
 wrote:
>
> Hi,
>
> > You can also try increasing the aggressiveness of the MDS recall but
> > I'm surprised it's still a problem with the settings I gave you:
> >
> > ceph config set mds mds_recall_max_caps 15000
> > ceph config set mds mds_recall_max_decay_rate 0.75
>
> I finally had the chance to try the more aggressive recall settings, but
> they did not change anything. As soon as the client starts copying files
> again, the numbers go up an I get a health message that the client is
> failing to respond to cache pressure.
>
> After this week of idle time, the dns/inos numbers (what does dns stand
> for anyway?) settled at around 8000k. That's basically that "idle"
> number that it goes back to when the client stops copying files. Though,
> for some weird reason, this number gets (quite) a bit higher every time
> (last time it was around 960k). Of course, I wouldn't expect it to go
> back all the way to zero, because that would mean dropping the entire
> cache for no reason, but it's still quite high and the same after
> restarting the MDS and all clients, which doesn't make a lot of sense to
> me. After resuming the copy job, the number went up to 20M in just the
> time it takes to write this email. There must be a bug somewhere.
>
> > Can you share two captures of `ceph daemon mds.X perf dump` about 1
> > second apart.
>
> I attached the requested perf dumps.

Thanks that helps. Looks like the problem is that the MDS is not
automatically trimming its cache fast enough. Please try bumping
mds_cache_trim_threshold:

bin/ceph config set mds mds_cache_trim_threshold 512K

Increase it further if it's not aggressive enough. Please let us know
if that helps.

It shouldn't be necessary to do this so I'll make a tracker ticket
once we confirm that's the issue.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-05 Thread Janek Bevendorff

Hi,


You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75


I finally had the chance to try the more aggressive recall settings, but 
they did not change anything. As soon as the client starts copying files 
again, the numbers go up an I get a health message that the client is 
failing to respond to cache pressure.


After this week of idle time, the dns/inos numbers (what does dns stand 
for anyway?) settled at around 8000k. That's basically that "idle" 
number that it goes back to when the client stops copying files. Though, 
for some weird reason, this number gets (quite) a bit higher every time 
(last time it was around 960k). Of course, I wouldn't expect it to go 
back all the way to zero, because that would mean dropping the entire 
cache for no reason, but it's still quite high and the same after 
restarting the MDS and all clients, which doesn't make a lot of sense to 
me. After resuming the copy job, the number went up to 20M in just the 
time it takes to write this email. There must be a bug somewhere.



Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.


I attached the requested perf dumps.


Thanks!



perf_dump_1.json
Description: application/json


perf_dump_2.json
Description: application/json
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff
I am not sure if making caps recall more aggressive helps. It seems to be the client failing to respond to it (at least that's what the warnings say).But I will try your new suggested settings as soon as I get the chance and will report back with the results. On 25 Jul 2019 11:00 pm, Patrick Donnelly  wrote:On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff
 wrote:
>
>
> > Based on that message, it would appear you still have an inode limit
> > in place ("mds_cache_size"). Please unset that config option. Your
> > mds_cache_memory_limit is apparently ~19GB.
>
> No, I do not have an inode limit set. Only the memory limit.
>
>
> > There is another limit mds_max_caps_per_client (default 1M) which the
> > client is hitting. That's why the MDS is recalling caps from the
> > client and not because any cache memory limit is hit. It is not
> > recommend you increase this.
> Okay, this this setting isn't documented either and I did not change it,
> but it's also quite clear that it isn't working. My MDS hasn't crashed
> yet (without the recall settings it would have), but ceph fs status is
> reporting 14M inodes at this point and the number is slowly going up.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 12:49 PM Janek Bevendorff
 wrote:
>
>
> > Based on that message, it would appear you still have an inode limit
> > in place ("mds_cache_size"). Please unset that config option. Your
> > mds_cache_memory_limit is apparently ~19GB.
>
> No, I do not have an inode limit set. Only the memory limit.
>
>
> > There is another limit mds_max_caps_per_client (default 1M) which the
> > client is hitting. That's why the MDS is recalling caps from the
> > client and not because any cache memory limit is hit. It is not
> > recommend you increase this.
> Okay, this this setting isn't documented either and I did not change it,
> but it's also quite clear that it isn't working. My MDS hasn't crashed
> yet (without the recall settings it would have), but ceph fs status is
> reporting 14M inodes at this point and the number is slowly going up.

Can you share two captures of `ceph daemon mds.X perf dump` about 1
second apart.

You can also try increasing the aggressiveness of the MDS recall but
I'm surprised it's still a problem with the settings I gave you:

ceph config set mds mds_recall_max_caps 15000
ceph config set mds mds_recall_max_decay_rate 0.75

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff


> Based on that message, it would appear you still have an inode limit
> in place ("mds_cache_size"). Please unset that config option. Your
> mds_cache_memory_limit is apparently ~19GB.

No, I do not have an inode limit set. Only the memory limit.


> There is another limit mds_max_caps_per_client (default 1M) which the
> client is hitting. That's why the MDS is recalling caps from the
> client and not because any cache memory limit is hit. It is not
> recommend you increase this.
Okay, this this setting isn't documented either and I did not change it,
but it's also quite clear that it isn't working. My MDS hasn't crashed
yet (without the recall settings it would have), but ceph fs status is
reporting 14M inodes at this point and the number is slowly going up.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Patrick Donnelly
On Thu, Jul 25, 2019 at 3:08 AM Janek Bevendorff
 wrote:
>
> The rsync job has been copying quite happily for two hours now. The good
> news is that the cache size isn't increasing unboundedly with each
> request anymore. The bad news is that it still is increasing afterall,
> though much slower. I am at 3M inodes now and it started off with 900k,
> settling at 1M initially. I had a peak just now of 3.7M, but it went
> back down to 3.2M shortly after that.
>
> According to the health status, the client has started failing to
> respond to cache pressure, so it's still not working as reliably as I
> would like it to. I am also getting this very peculiar message:
>
> MDS cache is too large (7GB/19GB); 52686 inodes in use by clients
>
> I guess the 53k inodes is the number that is actively in use right now
> (compared to the 3M for which the client generally holds caps). Is that
> so? Cache memory is still well within bounds, however. Perhaps the
> message is triggered by the recall settings and just a bit misleading?

Based on that message, it would appear you still have an inode limit
in place ("mds_cache_size"). Please unset that config option. Your
mds_cache_memory_limit is apparently ~19GB.

There is another limit mds_max_caps_per_client (default 1M) which the
client is hitting. That's why the MDS is recalling caps from the
client and not because any cache memory limit is hit. It is not
recommend you increase this.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-25 Thread Janek Bevendorff




It's possible the MDS is not being aggressive enough with asking the
single (?) client to reduce its cache size. There were recent changes
[1] to the MDS to improve this. However, the defaults may not be
aggressive enough for your client's workload. Can you try:

ceph config set mds mds_recall_max_caps 1
ceph config set mds mds_recall_max_decay_rate 1.0


Thank you. I was looking for config directives that do exactly this all 
week. Why are they not documented anywhere outside that blog post?


I added them as you described and the MDS seems to have stabilized and 
stays just under 1M inos now. I will continue to monitor it and see if 
it is working in the long run. Settings like these should be the default 
IMHO. Clients should never be able to crash the server just by holding 
onto their capabilities. If a server decides to drop things from its 
cache, clients must deal with it. Everything else threatens the 
stability of the system (and may even prevent the MDS from ever starting 
again, as we saw).



Also your other mailings made me think you may still be using the old
inode limit for the cache size. Are you using the new
mds_cache_memory_limit config option?


No, I am not. I tried it at some point to see if it made things better, 
but just like the memory cache limit, it seemed to have no effect 
whatsoever except for delaying the health warning.




Finally, if this fixes your issue (please let us know!) and you decide
to try multiple active MDS, you should definitely use pinning as the
parallel create workload will greatly benefit from it.


I will try that, although I directory tree is quite imbalanced.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-24 Thread Patrick Donnelly
+ other ceph-users

On Wed, Jul 24, 2019 at 10:26 AM Janek Bevendorff
 wrote:
>
> > what's the ceph.com mailing list? I wondered whether this list is dead but 
> > it's the list announced on the official ceph.com homepage, isn't it?
> There are two mailing lists announced on the website. If you go to
> https://ceph.com/resources/ you will find the
> subscribe/unsubscribe/archive links for the (much more active) ceph.com
> MLs. But if you click on "Mailing Lists & IRC page" you will get to a
> page where you can subscribe to this list, which is different. Very
> confusing.

It is confusing. This is supposed to be the new ML but I don't think
the migration has started yet.

> > What did you have the MDS cache size set to at the time?
> >
> > < and an inode count between
>
> I actually did not think I'd get a reply here. We are a bit further than
> this on the other mailing list. This is the thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036095.html
>
> To sum it up: the ceph client prevents the MDS from freeing its cache,
> so inodes keep piling up until either the MDS becomes too slow (fixable
> by increasing the beacon grace time) or runs out of memory. The latter
> will happen eventually. In the end, my MDSs couldn't even rejoin because
> they hit the host's 128GB memory limit and crashed.

It's possible the MDS is not being aggressive enough with asking the
single (?) client to reduce its cache size. There were recent changes
[1] to the MDS to improve this. However, the defaults may not be
aggressive enough for your client's workload. Can you try:

ceph config set mds mds_recall_max_caps 1
ceph config set mds mds_recall_max_decay_rate 1.0

Also your other mailings made me think you may still be using the old
inode limit for the cache size. Are you using the new
mds_cache_memory_limit config option?

Finally, if this fixes your issue (please let us know!) and you decide
to try multiple active MDS, you should definitely use pinning as the
parallel create workload will greatly benefit from it.

[1] https://ceph.com/community/nautilus-cephfs/

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com