Your use case makes sense to me.  There's nothing inherent in the design of
ZFS that would require 1GB memory per pool, but it's possible that the
implementation is faulty.  There has not been a lot of work on optimizing
the many-pools use case.  That said, I was able to create 100 pools on a
system with 7GB RAM (on ZoL master on Ubuntu, with ZFS root), and it
increased memory usage by about 0.5GB, for a total of around 4GB used
(including non-ZFS stuff).

script:
#!/bin/bash -x

free -m
arcstat

for (( i=0; i<100; i=i+1 )); do
truncate -s 1g /var/tmp/$i
zpool create test$i /var/tmp/$i
done

free -m
arcstat
sleep 10

for (( i=0; i<100; i=i+1 )); do
zpool export test$i
done

output:

+ free -m
              total        used        free      shared  buff/cache
available
Mem:           7465        3928        3021         151         515
 3143
+ arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
02:52:58     0     0      0     0    0     0    0     0    0   1.7G  3.6G

<create 100 pools>

+ free -m
              total        used        free      shared  buff/cache
available
Mem:           7465        4371        2577         151         517
 2700
+ arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
02:53:30     3     0      0     0    0     0    0     0    0   1.9G  3.6G

Is it possible that the memory usage is proportional to number of
filesystems (or snapshots, or something else inside the filesystem) rather
than number of pools?  How many filesystems/snapshots do you have in each
pool?  How's the memory usage if you have all the filesystems/snapshots in
one big pool (striped over all your disks)?

The ARC and dbuf caches are global (for all pools), so they should be
(trying to) respond to memory pressure globally, regardless of the number
of pools.  However, there are some aspects of limiting memory usage which
are per-pool, most prominently zfs_dirty_data_max, which is typically 4GB
per pool.  You would hit this when under heavy write workloads (to each
pool).  You'd probably want to decrease this, or find a way to implement it
globally rather than per pool.  An idle system wouldn't have any dirty
data, so this wouldn't come into play.

The next step would be to figure out where the memory is being used.  I'm
not familiar with all the tools on FreeBSD, but can you tell if the usage
is in the ARC vs the kernel heap?  If the latter, breaking it down by kmem
cache would help.

--matt

On Tue, Jul 2, 2019 at 2:49 AM <[email protected]> wrote:

> Hi,
>
> I use zfs on FreeBSD, but I have some theoretical questions, which may be
> independent on the OS.
> zfs has a lot of useful features, which makes it appealing for distributed
> storage as well.
> But building a distributed storage on top of zfs makes local redundancy
> well, redundant. :)
> If you use something over it which takes care of multi-host object
> redundancy, building a large, redundant zpool out of local disks is
> meaningless, while building a non-redundant pool locally is even worse (a
> failure of a single disk means you have to rebuild the whole machine over
> the network, which is innecessary).
>
> So it seems very logical to use a zpool on each disks, but I'm not sure of
> the consequences.
> We use a system in production built on this scheme, but we've started to
> observe OOM-like situations on heavy writes and even just mounting the
> pools (there are 2x-6x of them per machine) makes the machine eat all of
> the RAM (in FreeBSD's wired bucket, which means kernel memory and zfs's ARC
> is accounted here as well).
> On a given machine this means importing 44 zpools (just the import,
> nothing else running yet) raises the wired memory usage in top above 50
> GiBs, while ARC size remains low, around some GiBs.
>
> The questions:
> - what are the general consequences of having one zpool per vdev?
> - are there any fundamental problems with having around 20-60 zpools per
> machine?
> - does zfs have any memory requirements which scale with the used
> storage/stored blocks/objects (per zpool)? I mean I'm not sure why there is
> 50+ GiBs allocated just by importing the pools, while I haven't read a bit
> from them. I don't have dedup enabled, but have compression(gzip) and had a
> time interval with 1MiB block size, but that turned out to be problematic,
> so reverted to the default 128k.
> - are there any cache tunables which are per zpool (or zfs, because each
> zpool is a different zfs as well)? If this is the case, it could cause us
> problems, because a 4G cache on a single zpool may be fine, but 44*4 is
> just too much.
> - and while it's FreeBSD, I wonder if you may have any ideas on why should
> these OOMs may've appeared.
>
> Thanks,
> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see
> discussions <https://openzfs.topicbox.com/groups/developer> + participants
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options
> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
> <https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-M81a0e265f288c4d748157404>
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-M69d87c34b8db54cfeb69f832
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to