Re: [zfs-discuss] Dedup memory overhead

Andrey Kuzmin Thu, 21 Jan 2010 12:52:35 -0800

On Thu, Jan 21, 2010 at 10:00 PM, Richard Elling
<richard.ell...@gmail.com> wrote:
> On Jan 21, 2010, at 8:04 AM, erik.ableson wrote:
>
>> Hi all,
>>
>> I'm going to be trying out some tests using b130 for dedup on a server with 
>> about 1,7Tb of useable storage (14x146 in two raidz vdevs of 7 disks).  What 
>> I'm trying to get a handle on is how to estimate the memory overhead 
>> required for dedup on that amount of storage.  From what I gather, the dedup 
>> hash keys are held in ARC and L2ARC and as such are in competition for the 
>> available memory.
>
> ... and written to disk, of course.
>
> For ARC sizing, more is always better.
>
>> So the question is how much memory or L2ARC would be necessary to ensure 
>> that I'm never going back to disk to read out the hash keys. Better yet 
>> would be some kind of algorithm for calculating the overhead. eg - averaged 
>> block size of 4K = a hash key for every 4k stored and a hash occupies 256 
>> bits. An associated question is then how does the ARC handle competition 
>> between hash keys and regular ARC functions?
>
> AFAIK, there is no special treatment given to the DDT. The DDT is stored like
> other metadata and (currently) not easily accounted for.
>
> Also the DDT keys are 320 bits. The key itself includes the logical and 
> physical
> block size and compression. The DDT entry is even larger.


Looking at dedupe code, I noticed that on-disk DDT entries are
compressed less efficiently than possible: key is not compressed at
all (I'd expect roughly 2:1 compression ration with sha256 data),
while other entry data is currently passed through zle compressor only
(I'd expect this one to be less efficient than off-the-shelf
compressors, feel free to correct me if I'm wrong). Is this v1, going
to be improved in the future?

Further, with huge dedupe memory footprint and heavy performance
impact when DDT entries need to be read from disk, it might be
worthwhile to consider compression of in-core ddt entries
(specifically for DDTs or, more generally, making ARC/L2ARC
compression-aware). Has this been considered?

Regards,
Andrey

>
> I think it is better to think of the ARC as caching the uncompressed DDT
> blocks which were written to disk.  The number of these will be data 
> dependent.
> "zdb -S poolname" will give you an idea of the number of blocks and how well
> dedup will work on your data, but that means you already have the data in a
> pool.
>  -- richard
>
>
>> Based on these estimations, I think that I should be able to calculate the 
>> following:
>> 1,7   TB
>> 1740,8        GB
>> 1782579,2     MB
>> 1825361100,8  KB
>> 4     average block size
>> 456340275,2   blocks
>> 256   hash key size-bits
>> 1,16823E+11   hash key overhead - bits
>> 14602888806,4 hash key size-bytes
>> 14260633,6    hash key size-KB
>> 13926,4       hash key size-MB
>> 13,6  hash key overhead-GB
>>
>> Of course the big question on this will be the average block size - or 
>> better yet - to be able to analyze an existing datastore to see just how 
>> many blocks it uses and what is the current distribution of different block 
>> sizes. I'm currently playing around with zdb with mixed success  on 
>> extracting this kind of data. That's also a worst case scenario since it's 
>> counting really small blocks and using 100% of available storage - highly 
>> unlikely.
>>
>> # zdb -ddbb siovale/iphone
>> Dataset siovale/iphone [ZPL], ID 2381, cr_txg 3764691, 44.6G, 99 objects
>>
>>    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, 
>> flags 0x0
>>
>>    Object  lvl   iblk   dblk  dsize  lsize   %full  type
>>         0    7    16K    16K  57.0K    64K   77.34  DMU dnode
>>         1    1    16K     1K  1.50K     1K  100.00  ZFS master node
>>         2    1    16K    512  1.50K    512  100.00  ZFS delete queue
>>         3    2    16K    16K  18.0K    32K  100.00  ZFS directory
>>         4    3    16K   128K   408M   408M  100.00  ZFS plain file
>>         5    1    16K    16K  3.00K    16K  100.00  FUID table
>>         6    1    16K     4K  4.50K     4K  100.00  ZFS plain file
>>         7    1    16K  6.50K  6.50K  6.50K  100.00  ZFS plain file
>>         8    3    16K   128K   952M   952M  100.00  ZFS plain file
>>         9    3    16K   128K   912M   912M  100.00  ZFS plain file
>>        10    3    16K   128K   695M   695M  100.00  ZFS plain file
>>        11    3    16K   128K   914M   914M  100.00  ZFS plain file
>>
>> Now, if I'm understanding this output properly, object 4 is composed of 
>> 128KB blocks with a total size of 408MB, meaning that it uses 3264 blocks.  
>> Can someone confirm (or correct) that assumption? Also, I note that each 
>> object  (as far as my limited testing has shown) has a single block size 
>> with no internal variation.
>>
>> Interestingly, all of my zvols seem to use fixed size blocks - that is, 
>> there is no variation in the block sizes - they're all the size defined on 
>> creation with no dynamic block sizes being used. I previously thought that 
>> the -b option set the maximum size, rather than fixing all blocks.  Learned 
>> something today :-)
>>
>> # zdb -ddbb siovale/testvol
>> Dataset siovale/testvol [ZVOL], ID 45, cr_txg 4717890, 23.9K, 2 objects
>>
>>    Object  lvl   iblk   dblk  dsize  lsize   %full  type
>>         0    7    16K    16K  21.0K    16K    6.25  DMU dnode
>>         1    1    16K    64K      0    64K    0.00  zvol object
>>         2    1    16K    512  1.50K    512  100.00  zvol prop
>>
>> # zdb -ddbb siovale/tm-media
>> Dataset siovale/tm-media [ZVOL], ID 706, cr_txg 4426997, 240G, 2 objects
>>
>>    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, 
>> flags 0x0
>>
>>    Object  lvl   iblk   dblk  dsize  lsize   %full  type
>>         0    7    16K    16K  21.0K    16K    6.25  DMU dnode
>>         1    5    16K     8K   240G   250G   97.33  zvol object
>>         2    1    16K    512  1.50K    512  100.00  zvol prop
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup memory overhead

Reply via email to