On Thursday, July 04, 2019, at 5:03 AM, Matthew Ahrens wrote:
> Your use case makes sense to me.
Finally, you are the first! :)

> That said, I was able to create 100 pools on a system with 7GB RAM
I can as well. The difference seems to be that these pools have data on them.
And they are not new, they were created 2-3 years ago and have been heavily 
used since then.
Apart from this, they weren't overloaded, the usage went from 0 to the current 
~60% usage (so no above 80% fill).
The usage pattern is write once files (the are never modified, just read or 
deleted).

One of the pools:
# zpool get all disk2
NAME   PROPERTY                       VALUE                          SOURCE
disk2  size                           3.62T                          -
disk2  capacity                       60%                            -
disk2  altroot                        -                              default
disk2  health                         ONLINE                         -
disk2  guid                           16818933878072747776           default
disk2  version                        -                              default
disk2  bootfs                         -                              default
disk2  delegation                     on                             default
disk2  autoreplace                    off                            default
disk2  cachefile                      -                              default
disk2  failmode                       wait                           default
disk2  listsnapshots                  off                            default
disk2  autoexpand                     off                            default
disk2  dedupditto                     0                              default
disk2  dedupratio                     1.00x                          -
disk2  free                           1.42T                          -
disk2  allocated                      2.20T                          -
disk2  readonly                       off                            -
disk2  comment                        -                              default
disk2  expandsize                     -                              -
disk2  freeing                        0                              default
disk2  fragmentation                  72%                            -
disk2  leaked                         0                              default
disk2  bootsize                       -                              default
disk2  checkpoint                     -                              -
disk2  feature@async_destroy          enabled                        local
disk2  feature@empty_bpobj            enabled                        local
disk2  feature@lz4_compress           active                         local
disk2  feature@multi_vdev_crash_dump  enabled                        local
disk2  feature@spacemap_histogram     active                         local
disk2  feature@enabled_txg            active                         local
disk2  feature@hole_birth             active                         local
disk2  feature@extensible_dataset     active                         local
disk2  feature@embedded_data          active                         local
disk2  feature@bookmarks              enabled                        local
disk2  feature@filesystem_limits      enabled                        local
disk2  feature@large_blocks           active                         local
disk2  feature@sha512                 enabled                        local
disk2  feature@skein                  enabled                        local
disk2  feature@device_removal         enabled                        local
disk2  feature@obsolete_counts        enabled                        local
disk2  feature@zpool_checkpoint       enabled                        local
Oh, I'm not sure whether it's relevant: these filesystems were created with 
recordsize=1M (initially had larger files), but that was set back to the 
default 128k a year ago (I'm aware that this won't make potential large block 
disappear, the files were not rewritten).

> Is it possible that the memory usage is proportional to number of 
filesystems (or snapshots, or something else inside the filesystem) 
rather than number of pools? 
I don't think so, because I don't have any filesystems (well except the default 
one which is created with the pool) or snapshots inside the zpool/zfs.

> How many filesystems/snapshots do you have in each pool?
Exactly one zfs and no snapshots.

> How's the memory usage if you have all the filesystems/snapshots in one big 
> pool (striped over all your disks)?
Well, that's what I can't tell for several reasons:
1. I don't currently have a machine to which I could rewrite one machine's full 
data (I had those, but had to turn them into production to alleviate the 
problems caused by this effect).
2. even if I could have one, it would take a looot of time to rewrite a hundred 
TiB of data
BTW, I think it would work just fine. We have similar sized (mirror and raidz) 
pools (in the range of 50-100 TiB) with the same kind of HW and exactly the 
same OS and similar ZFS settings and use case, and never experienced anything 
like this. Well, maybe because we didn't hit a limit there, I don't know.

> However, there are some aspects of limiting memory usage which are 
per-pool, most prominently zfs_dirty_data_max, which is typically 4GB 
per pool. 
Yeah, I already tried to experiment with these tunables, but:
> You would hit this when under heavy write workloads (to each pool).  
You'd probably want to decrease this, or find a way to implement it 
globally rather than per pool.  An idle system wouldn't have any dirty 
data, so this wouldn't come into play.
you're right, dirty max would cause problems during operation, not during 
import...
I've tried to lower (to 1/10th of the default value) everything which seemed to 
be memory related and could not significantly decrease the kmem usage after 
import.

> The next step would be to figure out where the memory is being used.  
I'm not familiar with all the tools on FreeBSD, but can you tell if the 
usage is in the ARC vs the kernel heap?  If the latter, breaking it down
 by kmem cache would help.
Please see my answer to Richard:
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-Mcbe0070f704b59907a38125d/using-many-zpools
I'm sure that it's not ARC because of these reasons:
1. on FreeBSD top has a separate line for ARC and it remained low during the 
import (well nothing really reads from the pools at that time, so this must be 
the case)
2. I've limited ARC size with the kernel tunable
3. limiting the ARC size nearly instanteously decreased the ARC/kmem usage

BTW, I have new -I very much hope valuable- details to share!

During the high kmem usage case, importing all of the pools took ages. I 
haven't measured, but well over an hour!
During the import the given disk under the pool (and it seems the zpool import 
-a itself is sequential, because only one disk worked at a time) worked hard. 
I'm not quite sure whether it wrote a lot or not, but it read a lot for sure, 
randomly.
With each import, the kmem usage grew by 1-1.5 GiBs.
No scrubs were running.

Now guess what happened today! All machines were rebooted (they were dying 
since days, were restarted several times) and they came up quickly, the pools 
were mounted fast and after mounting all of them only around 7-10 GiBs kmem 
were in use, which is totally acceptible!

Previously I thought those 1M recordsize-written blocks cause the problem (or 
take part of it), so I started to rewrite the files with 128k. When I started 
the rewrite process (with the high kmem problem in effect) any machine could 
die in no more than 1 minute, because of the ARC (and other stuff) growing and 
eating all remaining kernel memory.

Now the rewrite runs since half an hour and everything is fine.

We've experienced this last year, but then it also took for some days and 
disappeared automatically. They were working fine for around a year and then 
the same problem came back and disappeared (for now) again.

What does ZFS do one time, which eats all kmem (and remains there after the 
import) which it does not later?
Does anybody know about anything during import time which can cause this?

Now I would also do a zpool export and import to see whether that makes a 
change, but now all machines work fine.
What the...?!
------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-Mba507e84ad3760f01f5ddf2e
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to