Re: [developer] Pathway to better DDT, and value-for-effort assessment of mitigations in the meantime.

2019-07-09 Thread Richard Elling


> On Jul 7, 2019, at 11:09 AM, Stilez  wrote:
> 
> I feel that, while that's true and valid, it also kind of misses the point?
> 
> What I'm wondering is, are there simple enhancements that would be beneficial 
> in that area, or provide useful internal data. 
> 
> It seems plausible that if a configurable in space map block size helps, 
> perhaps a configurable DDT block size could as well, and that if DDT contents 
> are needed on every file load/save for a deduped pool, then a way to preload 
> them could be beneficial in the same.way as a way to preload spacemap data.

Both spacemaps and DDT are AVL trees. But there is one DDT vs hundreds (or 
more) of spacemaps.
But spacemaps are only needed for writes, so if we aren't allocating space from 
a metaslab, the
spacemap for that metaslab can be evicted from RAM. Or, to look at it another 
way, spacemaps are
constrained to space (LBA range) but DDT covers the whole pool.

> 
> Clearly allocation devices will help, but gains are always layered. "Faster 
> storage will fix it" isnt really an answer, any more than its an answer for 
> any greatly bottlenecked critical pathway in  a file system - and for deduped 
> pools, access to DDT records is *the* critical element, no user data can be 
> read from disk or put to txgs without them. Fundamentally there are a few 
> useful configurables available for spacemaps that could be potential wins if 
> also available for DDTs.  Since analogs for these settings already exist in 
> the code, perhaps no great work is involved in them, and perhaps they would 
> give cheap but significant IO gains for pools with dedup enabled.

In the past, some folks have proposed a different structure for DDT, such as 
using bloom filters or
other fast-lookup techniques. But as Garrett points out, the real wins are hard 
to come by. Meanwhile,
PRs are welcome.
 -- richard

> 
> Hence I think the question is valid, and remains valid both before allocation 
> classes and after them, and might be worth considering deeper.
> 
> 
> 
> On 7 July 2019 16:03:59 Richard Elling  
> wrote:
> 
>> Yes, from the ZoL zpool man page:
>> A device dedicated solely for deduplication tables.
>> 
>>   -- richard
>> 
>> 
>> 
>> On Jul 7, 2019, at 5:41 AM, Stilez > > wrote:
>> 
>>> "Dedup special class"?
>>> 
>>> On 6 July 2019 16:24:27 Richard Elling >> > wrote:
>>> 
 
 On Jul 5, 2019, at 9:11 PM, Stilez >>> > wrote:
 
> I'm one of many end-users with highly dedupable pools held back by DDT 
> and spacemap RW inefficiencies. There's been discussion and presentations 
> - Matt Ahrens' talk at BSDCan 2016 ("Dedup doesn't have to suck") was 
> especially useful, and allocation classes from the ZoL/ZoF work will 
> allow metadata-specific offload to SSD. But briad discussion of this 
> general area is not on the roadmap atm, probably bc so much else is a 
> priority and seems nobody's stepped up.
 
 In part because dedup will always be slower than non-dedup while the cost 
 of storage continues to plummet (flash SSDs down 40% in the past year and 
 there is currently an oversupply of NAND). A good starting point for 
 experiments is to use the dedup special class and report back to the 
 community how well it works for you.
 
   -- richard
 
 openzfs  / openzfs-developer / see 
 discussions  + participants 
  + delivery options 
 Permalink 
 
> 
> 

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Td9c7189186fd24f2-M3d9e7f78fbe5012e53b176a8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Using many zpools

2019-07-09 Thread nagy . attila
On Tuesday, July 09, 2019, at 8:10 PM, Matthew Ahrens wrote:
> We expect to open PR's against ZoL for this work within the next month (it 
> depends on https://github.com/zfsonlinux/zfs/pull/8442).
Oh, after reading this, things are even more clear.
And this explains why I could see a massive IO increase after switching from 
UFS to ZFS on many systems (with a lot of small files and a high fragmentation).

Very useful thread indeed. Thanks a lot, Matt!
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-M2e463473f8d66aca5542b159
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Using many zpools

2019-07-09 Thread nagy . attila
On Tuesday, July 09, 2019, at 8:10 PM, Matthew Ahrens wrote:
> This behavior is not really specific to having a lot of pools.  If you had 
> one big pool with all the disks in it, ZFS would still try to allocate from 
> each disk, causing most of that disk's metaslabs to be loaded (ZFS selects 
> first disk, then metaslab, then offset within that metaslab).
But wouldn't it be 46th of the current situation? Or the memory requirement 
really scales with the stored amount (the number of blocks) in them, so it 
doesn't matter if I have ?
I have to read on how these work in depth.

> You probably have a workload with lots of small files
No, you're right, there are a lot of smallish files now.
I didn't understand the output of zdb, but now I get it that the first number 
is the record size (in 2^x). And just by this it's a whole lot more clear. :)

Anyways it all makes sense, thank you very much for the detailed description 
(and the hope that things will improve)!
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T10533b84f9e1cfc5-M31f15e4480398331f9d4502b
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] Re: Using many zpools

2019-07-09 Thread Matthew Ahrens
It looks like your disks are quite fragmented, with most of the free space
being in 8K-63KB chunks.  There are very few larger (>=128K) free chunks,
which can cause ZFS to load most of the metaslabs when looking for a 128K
free chunk (which it does periodically, especially if the ZIL is in heavy
use).

I would guess that what you are seeing is that when you import each pool,
it is playing the ZIL (since you had an unclean shutdown), and that is
needing to allocate some larger chunks (around 128KB), which needs to
search (and thus load) most of the metaslabs.  This is causing the 26GB of
RAM used by range_seg_cache.

This behavior is not really specific to having a lot of pools.  If you had
one big pool with all the disks in it, ZFS would still try to allocate from
each disk, causing most of that disk's metaslabs to be loaded (ZFS selects
first disk, then metaslab, then offset within that metaslab).

The good news is that we (Delphix, specifically Paul Dagnelie and I) have
been working on improving metaslab memory usage, achieving a roughly 5x
reduction in memory requirement (i.e. loaded metaslabs will use 1/5 the RAM
as previous).  We expect to open PR's against ZoL for this work within the
next month (it depends on https://github.com/zfsonlinux/zfs/pull/8442).

If you are able to decrease fragmentation, you'll also see memory
improvements here.  You probably have a workload with lots of small files
(or a few big files/zvols with small recordsize/volblocksize), which is
inherently difficult on fragmentation.  But you can decrease fragmentation
by using a larger sector size (ashift), at the cost of some compression
ratio.

--matt

On Mon, Jul 8, 2019 at 2:43 PM  wrote:

> On Friday, July 05, 2019, at 7:43 PM, Matthew Ahrens wrote:
>
> How many metaslabs are there total, and how many are loaded?
>
>
> zdb -m for a disk (on machine 03, disk5, for future reference) says:
>
> Metaslabs:
> vdev  0
> metaslabs   116   offsetspacemap  free
> ---   ---   ---   -
> metaslab  0   offset0   spacemap 37   free9.87G
> metaslab  1   offset8   spacemap 38   free10.1G
> metaslab  2   offset   10   spacemap 44   free11.5G
> metaslab  3   offset   18   spacemap 45   free12.1G
> metaslab  4   offset   20   spacemap 50   free10.4G
> metaslab  5   offset   28   spacemap 55   free9.59G
> metaslab  6   offset   30   spacemap 56   free9.43G
> metaslab  7   offset   38   spacemap 59   free10.9G
> metaslab  8   offset   40   spacemap 60   free10.6G
> metaslab  9   offset   48   spacemap 64   free10.3G
> metaslab 10   offset   50   spacemap 65   free11.3G
> metaslab 11   offset   58   spacemap 69   free13.7G
> metaslab 12   offset   60   spacemap 70   free12.6G
> metaslab 13   offset   68   spacemap 74   free12.7G
> metaslab 14   offset   70   spacemap 41   free13.3G
> metaslab 15   offset   78   spacemap 46   free12.3G
> metaslab 16   offset   80   spacemap 52   free11.6G
> metaslab 17   offset   88   spacemap 61   free10.8G
> metaslab 18   offset   90   spacemap 66   free10.4G
> metaslab 19   offset   98   spacemap 71   free11.5G
> metaslab 20   offset   a0   spacemap 75   free11.8G
> metaslab 21   offset   a8   spacemap 79   free11.5G
> metaslab 22   offset   b0   spacemap 36   free13.0G
> metaslab 23   offset   b8   spacemap 80   free12.1G
> metaslab 24   offset   c0   spacemap 43   free13.1G
> metaslab 25   offset   c8   spacemap 82   free12.9G
> metaslab 26   offset   d0   spacemap 83   free13.4G
> metaslab 27   offset   d8   spacemap 54   free10.9G
> metaslab 28   offset   e0   spacemap 86   free12.0G
> metaslab 29   offset   e8   spacemap 58   free11.3G
> metaslab 30   offset   f0   spacemap 89   free10.3G
> metaslab 31   offset   f8   spacemap 63   free11.9G
> metaslab 32   offset  100   spacemap 90   free10.4G
> metaslab 33   offset  108   spacemap 68   free12.5G
> metaslab 34   offset  110   spacemap 91   free11.7G
> metaslab 35   offset  118   spacemap 73   free12.1G
> metaslab 36   offset  120   spacemap 39   free11.1G
> metaslab 37   offset  128   spacemap 94   free11.7G
> metaslab 38   offset  130   spacemap 51   free12.5G
> metaslab 39   offset  138