Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-05-04 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 Sent: Friday, April 29, 2011 12:49 AM
 
 The lower bound of ARC size is c_min
 
 # kstat -p zfs::arcstats:c_min

I see there is another character in the plot:  c_max
c_max seems to be 80% of system ram (at least on my systems).

I assume this means the ARC will never grow larger than 80%, so if you're
trying to calculate the ram needed for your system, in order to hold DDT and
L2ARC references in ARC...  This better be factored into the equation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Brandon High
On Thu, Apr 28, 2011 at 6:48 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 What does it mean / what should you do, if you run that command, and it
 starts spewing messages like this?
 leaked space: vdev 0, offset 0x3bd8096e00, size 7168

I'm not sure there's much you can do about it short of deleting
datasets and/or snapshots.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Sean Sprague



: xvm-4200m2-02 ;
 

I can do the echo | mdb -k.  But what is that : xvm-4200 command?
   


My guess is that is a very odd shell prompt ;-)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Neil Perrin

On 04/30/11 01:41, Sean Sprague wrote:



: xvm-4200m2-02 ;
 

I can do the echo | mdb -k.  But what is that : xvm-4200 command?
   


My guess is that is a very odd shell prompt ;-)

- Indeed
   ':' means what follows a comment (at least to /bin/ksh)
   'xvm-4200m2-02' is the comment  - actually the system name (not very 
inventive)

   ';' ends the comment.

I use this because I can cut and paste entire lines back to the shell.

Sorry for the confusion: Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Roy Sigurd Karlsbakk
 And one of these:
 Assertion failed: space_map_load(msp-ms_map, zdb_space_map_ops,
 0x0,
 msp-ms_smo, spa-spa_meta_objset) == 0, file ../zdb.c, line 1439,
 function
 zdb_leak_init
 Abort (core dumped)
 
 I saved the core and ran again. This time it spewed leaked space
 messages
 for an hour, and completed. But the final result was physically
 impossible
 (it counted up 744k total blocks, which means something like 3Megs per
 block
 in my 2.39T used pool. I checked compressratio is 1.00x and I have no
 compression.)
 
 I ran again.
 
 Still spewing messages. This can't be a good sign.
 
 Anyone know what it means, or what to do about it?

IIRC it runs out of memory, not space.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  Worse yet, your arc consumption could be so large, that
  PROCESSES don't fit in ram anymore.  In this case, your processes get
 pushed
  out to swap space, which is really bad.
 
 This will not happen. The ARC will be asked to shrink when other memory
 consumers demand memory. The lower bound of ARC size is c_min

Makes sense.  Is c_min a constant?  Suppose processes are consuming a lot of
memory.  Will c_min protect L2ARC entries in the ARC?  At least on my
systems, it seems that c_min is fixed at 10% of the total system memory.

If c_min is sufficiently small, relative to the amount of ARC that would be
necessary to index the L2ARC...  Since every entry in L2ARC requires an
entry in ARC, this seems to imply, that if process memory consumption is
high, then both the ARC and L2ARC are effectively useless.
 
Things sometimes get evicted from ARC completely, and sometimes they get
evicted into L2ARC with only a reference still remaining in ARC.  But if
processes consume enough memory on the system so as to shrink the ARC to
effectively nonexistent, then the L2ARC must also be nonexistent.


 L2ARC is populated by a thread that watches the soon-to-be-evicted list.

This seems to imply, if processes start consuming a lot of memory, the first
thing to disappear is the ARC, and the second thing to disappear is the
L2ARC (because the L2ARC references stored in ARC get evicted from ARC after
other things in ARC)


 AVL trees

Good to know.  Thanks.


  So the point is - Whenever you do a write, and the calculated DDT is not
  already in ARC/L2ARC, the system will actually perform several small
reads
  looking for the DDT entry before it finally knows that the DDT entry
  actually exists.  So the penalty of performing a write, with dedup
enabled,
  and the relevant DDT entry not already in ARC/L2ARC is a very large
 penalty.
 
 very is a relative term, 

Agreed.  Here is what I was implying:
Suppose you don't have enough ram to hold the complete DDT.  And you perform
a bunch of random writes (whether sync or async).  Then you will suffer a
lot of cache-misses searching for DDT entries, and the consequence will be
... For every little write that could potentially have only the disk penalty
of one little write, instead has the disk penalty of several reads plus a
write.  So your random write performance is effectively several times slower
than it could potentially have been, if only you had more ram.

Reads are unaffected, except if there's random-write congestion hogging disk
time.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Roy Sigurd Karlsbakk
 Controls whether deduplication is in effect for a
 dataset. The default value is off. The default checksum
 used for deduplication is sha256 (subject to change).
snip/
 
 This is from b159.

This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
combination with verify (which I would use anyway, since there are always tiny 
chances of collisions), why would sha256 be a better choice?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Brandon High
On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net 
wrote:
 This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
 combination with verify (which I would use anyway, since there are always 
 tiny chances of collisions), why would sha256 be a better choice?

fletcher4 was only an option for snv_128, which was quickly pulled and
replaced with snv_128b which removed fletcher4 as an option.

The official post is here:
http://www.opensolaris.org/jive/thread.jspa?threadID=118519tstart=0#437431

It looks like fletcher4 is still an option in snv_151a for non-dedup
datasets, and is in fact the default.

As an aside: Erik, any idea when the 159 bits will make it to the public?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Erik Trimble

On 4/29/2011 9:44 AM, Brandon High wrote:

On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakkr...@karlsbakk.net  
wrote:

This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
combination with verify (which I would use anyway, since there are always tiny 
chances of collisions), why would sha256 be a better choice?

fletcher4 was only an option for snv_128, which was quickly pulled and
replaced with snv_128b which removed fletcher4 as an option.

The official post is here:
http://www.opensolaris.org/jive/thread.jspa?threadID=118519tstart=0#437431

It looks like fletcher4 is still an option in snv_151a for non-dedup
datasets, and is in fact the default.

As an aside: Erik, any idea when the 159 bits will make it to the public?

-B


Yup, fletcher4 is still the default for any fileset not using dedup.  
It's good enough, and I can't see any reason to change it for those 
purposes (since it's collision problems aren't much of an issue when 
just doing data integrity checks).


Sorry, no idea on release date stuff. I'm completely out of the loop on 
release info.  I'm lucky if I can get a heads up before it actually gets 
published internally.


:-(


I'm just a lowly Java Platform Group dude.   Solaris ain't my silo.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Edward Ned Harvey
 From: Edward Ned Harvey
 I saved the core and ran again.  This time it spewed leaked space
messages
 for an hour, and completed.  But the final result was physically
impossible (it
 counted up 744k total blocks, which means something like 3Megs per block
in
 my 2.39T used pool.  I checked compressratio is 1.00x and I have no
 compression.)
 
 I ran again.
 
 Still spewing messages.  This can't be a good sign.
 
 Anyone know what it means, or what to do about it?

After running again, I get an even more impossible number ... 45.4K total
blocks, which would mean something like 50 megs per block.

This pool does scrub regularly (every other week).  In fact, it's scheduled
to scrub this weekend

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 What does it mean / what should you do, if you run that command, and it
 starts spewing messages like this?
 leaked space: vdev 0, offset 0x3bd8096e00, size 7168

And one of these:
Assertion failed: space_map_load(msp-ms_map, zdb_space_map_ops, 0x0,
msp-ms_smo, spa-spa_meta_objset) == 0, file ../zdb.c, line 1439, function
zdb_leak_init
Abort (core dumped)

I saved the core and ran again.  This time it spewed leaked space messages
for an hour, and completed.  But the final result was physically impossible
(it counted up 744k total blocks, which means something like 3Megs per block
in my 2.39T used pool.  I checked compressratio is 1.00x and I have no
compression.)

I ran again.

Still spewing messages.  This can't be a good sign.

Anyone know what it means, or what to do about it?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Edward Ned Harvey
 From: Neil Perrin [mailto:neil.per...@oracle.com]

 The size of these structures will vary according to the release you're
running.
 You can always find out the size for a particular system using ::sizeof
within
 mdb. For example, as super user :
 
 : xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k
 sizeof (ddt_entry_t) = 0x178
 : xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k
 sizeof (arc_buf_hdr_t) = 0x100
 : xvm-4200m2-02 ;

I can do the echo | mdb -k.  But what is that : xvm-4200 command?  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Erik Trimble
OK, I just re-looked at a couple of things, and here's what I /think/ is
the correct numbers.

A single entry in the DDT is defined in the struct ddt_entry :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I just checked, and the current size of this structure is 0x178, or 376
bytes.


Each ARC entry, which points to either an L2ARC item (of any kind,
cached data, metadata, or a DDT line) or actual data/metadata/etc., is
defined in the struct arc_buf_hdr :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#431

It's current size is 0xb0, or 176 bytes.

These are fixed-size structures.


PLEASE - someone correct me if these two structures AREN'T what we
should be looking at.



So, our estimate calculations have to be based on these new numbers.


Back to the original scenario:

1TB (after dedup) of 4k blocks: how much space is needed for the DDT,
and how much ARC space is needed if the DDT is kept in a L2ARC cache
device?

Step 1)  1TB (2^40 bytes) stored in blocks of 4k (2^12) = 2^28 blocks
total, which is about 268 million.

Step 2)  2^28 blocks of information in the DDT requires  376 bytes/block
* 2^28 blocks = 94 * 2^30 = 94 GB of space.  

Step 3)  Storing a reference to 268 million (2^28) DDT entries in the
L2ARC will consume the following amount of ARC space: 176 bytes/entry *
2^28 entries = 44GB of RAM.


That's pretty ugly.


So, to summarize:

For 1TB of data, broken into the following block sizes:
DDT sizeARC consumption
512b752GB (73%) 352GB (34%)
4k  94GB (9%)   44GB (4.3%)
8k  47GB (4.5%) 22GB (2.1%)
32k 11.75GB (2.2%)  5.5GB (0.5%)
64k 5.9GB (1.1%)2.75GB (0.3%)
128k2.9GB% (0.6%)   1.4GB (0.1%)

ARC consumption presumes the whole DDT is stored in the L2ARC.

Percentage size is relative to the original 1TB total data size



Of course, the trickier proposition here is that we DON'T KNOW what our
dedup value is ahead of time on a given data set.  That is, given a data
set of X size, we don't know how big the deduped data size will be. The
above calculations are for DDT/ARC size for a data set that has already
been deduped down to 1TB in size.


Perhaps it would be nice to have some sort of userland utility that
builds it's own DDT as a test and does all the above calculations, to
see how dedup would work on a given dataset.  'zdb -S' sorta, kinda does
that, but...


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Edward Ned Harvey
 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 OK, I just re-looked at a couple of things, and here's what I /think/ is
 the correct numbers.
 
 I just checked, and the current size of this structure is 0x178, or 376
 bytes.
 
 Each ARC entry, which points to either an L2ARC item (of any kind,
 cached data, metadata, or a DDT line) or actual data/metadata/etc., is
 defined in the struct arc_buf_hdr :
 
 http://src.opensolaris.org/source/xref/onnv/onnv-
 gate/usr/src/uts/common/fs/zfs/arc.c#431
 
 It's current size is 0xb0, or 176 bytes.
 
 These are fixed-size structures.

heheheh...  See what I mean about all the conflicting sources of
information?  Is it 376 and 176?  Or is it 270 and 200?
Erik says it's fixed-size.  Richard says The DDT entries vary in size.

So far, what Erik says is at least based on reading the source code, with a
disclaimer of possibly misunderstanding the source code.  What Richard says
is just a statement of supposed absolute fact without any backing.

In any event, thank you both for your input.  Can anyone answer these
authoritatively?  (Neil?)   I'll send you a pizza.  ;-)


 For 1TB of data, broken into the following block sizes:
   DDT sizeARC consumption
 512b  752GB (73%) 352GB (34%)
 4k94GB (9%)   44GB (4.3%)
 8k47GB (4.5%) 22GB (2.1%)
 32k   11.75GB (2.2%)  5.5GB (0.5%)
 64k   5.9GB (1.1%)2.75GB (0.3%)
 128k  2.9GB% (0.6%)   1.4GB (0.1%)

At least the methodology to calculate all this seems reasonable to me.  If
the new numbers (376 and 176) are correct, I would just state it like this:

DDT size = 376b * # unique blocks
You can find the number of blocks in an existing filesystem using zdb -bb
poolname

ARC consumption = 176b * #blocks in the L2ARC
You can estimate the #blocks in L2ARC, if you divide total pool disk usage
by the number of blocks in pool obtained above, to find the average block
size in pool.  Divide the total L2ARC capacity by the average block size,
and you get the number of average-sized blocks stored in your L2ARC.
(Or take L2ARC capacity / Total pool usage, * #blocks in whole pool.  To
estimate #blocks in L2ARC)


 
 ARC consumption presumes the whole DDT is stored in the L2ARC.
 
 Percentage size is relative to the original 1TB total data size
 
 
 
 Of course, the trickier proposition here is that we DON'T KNOW what our
 dedup value is ahead of time on a given data set.  That is, given a data
 set of X size, we don't know how big the deduped data size will be. The
 above calculations are for DDT/ARC size for a data set that has already
 been deduped down to 1TB in size.
 
 
 Perhaps it would be nice to have some sort of userland utility that
 builds it's own DDT as a test and does all the above calculations, to
 see how dedup would work on a given dataset.  'zdb -S' sorta, kinda does
 that, but...
 
 
 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-317
 Phone:  x67195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Neil Perrin

On 4/28/11 12:45 PM, Edward Ned Harvey wrote:

From: Erik Trimble [mailto:erik.trim...@oracle.com]

OK, I just re-looked at a couple of things, and here's what I /think/ is
the correct numbers.

I just checked, and the current size of this structure is 0x178, or 376
bytes.

Each ARC entry, which points to either an L2ARC item (of any kind,
cached data, metadata, or a DDT line) or actual data/metadata/etc., is
defined in the struct arc_buf_hdr :

http://src.opensolaris.org/source/xref/onnv/onnv-
gate/usr/src/uts/common/fs/zfs/arc.c#431

It's current size is 0xb0, or 176 bytes.

These are fixed-size structures.

heheheh...  See what I mean about all the conflicting sources of
information?  Is it 376 and 176?  Or is it 270 and 200?
Erik says it's fixed-size.  Richard says The DDT entries vary in size.

So far, what Erik says is at least based on reading the source code, with a
disclaimer of possibly misunderstanding the source code.  What Richard says
is just a statement of supposed absolute fact without any backing.

In any event, thank you both for your input.  Can anyone answer these
authoritatively?  (Neil?)   I'll send you a pizza.  ;-)



- I wouldn't consider myself an authority on the dedup code.
The size of these structures will vary according to the release you're running. 
You can always find out the size for a particular system using ::sizeof within
mdb. For example, as super user :

: xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k
sizeof (ddt_entry_t) = 0x178
: xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k
sizeof (arc_buf_hdr_t) = 0x100
: xvm-4200m2-02 ;

This shows yet another size. Also there are more changes planned within
the arc. Sorry, I can't talk about those changes and nor when you'll
see them.

However, that's not the whole story. It looks like the arc_buf_hdr_t
use their own kmem cache so there should be little wastage, but the
ddt_entry_t are allocated from the generic kmem caches and so will
probably have some roundup and unused space. Caches for small buffers
are aligned to 64 bytes. See kmem_alloc_sizes[] and comment:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920

Pizza: Mushroom and anchovy - er, just kidding.

Neil.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Erik Trimble
On Thu, 2011-04-28 at 13:59 -0600, Neil Perrin wrote:
 On 4/28/11 12:45 PM, Edward Ned Harvey wrote:
 
  In any event, thank you both for your input.  Can anyone answer these
  authoritatively?  (Neil?)   I'll send you a pizza.  ;-)
 
 
 - I wouldn't consider myself an authority on the dedup code.
 The size of these structures will vary according to the release you're 
 running. You can always find out the size for a particular system using 
 ::sizeof within
 mdb. For example, as super user :
 
 : xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k
 sizeof (ddt_entry_t) = 0x178
 : xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k
 sizeof (arc_buf_hdr_t) = 0x100
 : xvm-4200m2-02 ;
 

yup, that's how I got them.  Just to add to the confusion, there are
typedefs in the code which can make names slightly different:

typedef struct arc_buf_hdr arc_buf_hdr_t;

typedef struct ddt_entry ddt_entry_t;


I got my values from a x86 box running b159, and a SPARC box running
S10u9.  The values were the same from both.

E.g.:

root@invisible:~# uname -a
SunOS invisible 5.11 snv_159 i86pc i386 i86pc Solaris
root@invisible:~# isainfo
amd64 i386
root@invisible:~# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
pcplusmp scsi_vhci zfs ip hook neti arp usba uhci fctl stmf kssl
stmf_sbd sockfs lofs random sata sd fcip cpc crypto nfs logindmux ptm
ufs sppp ipc ]
 ::sizeof struct arc_buf_hdr
sizeof (struct arc_buf_hdr) = 0xb0
 ::sizeof struct ddt_entry
sizeof (struct ddt_entry) = 0x178




 This shows yet another size. Also there are more changes planned within
 the arc. Sorry, I can't talk about those changes and nor when you'll
 see them.
 
 However, that's not the whole story. It looks like the arc_buf_hdr_t
 use their own kmem cache so there should be little wastage, but the
 ddt_entry_t are allocated from the generic kmem caches and so will
 probably have some roundup and unused space. Caches for small buffers
 are aligned to 64 bytes. See kmem_alloc_sizes[] and comment:
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920
 

Ugg. I hadn't even thought of memory alignment/allocation issues.


 Pizza: Mushroom and anchovy - er, just kidding.
 
 Neil.

And, let me say: Yuck!  What is that, an ISO-standard pizza? Disgusting.
ANSI-standard pizza, all the way!  (pepperoni  mushrooms)



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High
On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
 to (not instead of) the fletcher2 integrity checksum.  So after bootup,

My understanding is that enabling dedup forces sha256.

The default checksum used for deduplication is sha256 (subject to
change). When dedup is enabled, the dedup checksum algorithm overrides
the checksum property.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Erik Trimble
On Thu, 2011-04-28 at 14:33 -0700, Brandon High wrote:
 On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
  Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
  to (not instead of) the fletcher2 integrity checksum.  So after bootup,
 
 My understanding is that enabling dedup forces sha256.
 
 The default checksum used for deduplication is sha256 (subject to
 change). When dedup is enabled, the dedup checksum algorithm overrides
 the checksum property.
 
 -B
 

From the man page for zfs(1)


 dedup=on | off | verify | sha256[,verify]

 Controls  whether  deduplication  is  in  effect  for  a
 dataset.  The default value is off. The default checksum
 used for deduplication is sha256  (subject  to  change).
 When  dedup  is  enabled,  the  dedup checksum algorithm
 overrides the checksum property. Setting  the  value  to
 verify is equivalent to specifying sha256,verify.

 If the property is set to  verify,  then,  whenever  two
 blocks  have the same signature, ZFS will do a byte-for-
 byte comparison with the existing block to  ensure  that
 the contents are identical.




This is from b159.



A careful reading of the man page seems to imply that there's no way to
change the dedup checksum algorithm from sha256, as the dedup property
ignores the checksum property, and there's no provided way to explicitly
set a checksum algorithm specific to dedup (i.e. there's no way to
override the default for dedup).





-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High
On Thu, Apr 28, 2011 at 3:05 PM, Erik Trimble erik.trim...@oracle.com wrote:
 A careful reading of the man page seems to imply that there's no way to
 change the dedup checksum algorithm from sha256, as the dedup property
 ignores the checksum property, and there's no provided way to explicitly
 set a checksum algorithm specific to dedup (i.e. there's no way to
 override the default for dedup).

That's my understanding as well. The initial release used fletcher4 or
sha256, but there was either a bug in the fletcher4 code or a hash
collision that required removing it as an option.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Edward Ned Harvey
 From: Brandon High [mailto:bh...@freaks.com]
 Sent: Thursday, April 28, 2011 5:33 PM
 
 On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
  Correct me if I'm wrong, but the dedup sha256 checksum happens in
 addition
  to (not instead of) the fletcher2 integrity checksum.  So after bootup,
 
 My understanding is that enabling dedup forces sha256.
 
 The default checksum used for deduplication is sha256 (subject to
 change). When dedup is enabled, the dedup checksum algorithm overrides
 the checksum property.

Interesting.  So it would seem, that the DDT probably does get populated
into ARC, simply by having read something from disk.  That was one important
consequence in discussion ...   (DDT does not only get populated into ARC
during writes.)  PS. I'm only drawing conclusions here, so please tell me
I'm wrong if I'm wrong somehow.

The other important consequence, not yet answered:

When a block is scheduled to be written, system performs checksum, and looks
for a matching entry in DDT in ARC/L2ARC.  In the event of an ARC/L2ARC
cache miss for a DDT entry which actually exists, the system will need to
perform a number of small disk reads in order to fetch the DDT entry from
disk.  Correct?  I figure at least one, probably more than one, read to
locate the entry on disk, and then another read to actually read the entry.
After this, the system knows there is a checksum match between the block
waiting to be written, and another block that's already on disk, and it
could possibly have to do yet another read for verification, before it is
able to finally do the write.  Right?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Edward Ned Harvey
 From: Tomas Ögren [mailto:st...@acc.umu.se]
 
 zdb -bb pool

Oy - this is scary - Thank you by the way for that command - I've been
gathering statistics across a handful of systems now ...

What does it mean / what should you do, if you run that command, and it
starts spewing messages like this?
leaked space: vdev 0, offset 0x3bd8096e00, size 7168



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Richard Elling
[the dog jumped on the keyboard and wiped out my first reply, second attempt...]

On Apr 27, 2011, at 9:26 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 No, that's not true. The DDT is just like any other ZFS metadata and can
 be
 split over the ARC,
 cache device (L2ARC) and the main pool devices. An infrequently referenced
 DDT block will get
 evicted from the ARC to the L2ARC then evicted from the L2ARC.
 
 When somebody has their baseline system, and they're thinking about adding
 dedup and/or cache, I'd like to understand the effect of not having enough
 ram.  Obviously the impact will be performance, but precisely...

precisely only works when you know precisely what your data looks like.
For most folks, that is unknown in advance.

slow disks + small RAM = bad recipe for dedup

 At bootup, I presume the arc  l2arc are all empty.  So all the DDT entries
 reside in pool.  

Yes

 As the system reads things (anything, files etc) from pool,
 it will populate arc, and follow fill rate policies to populate the l2arc
 over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
 what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
 memory as any other type of l2arc entry.)

Approximately 200 bytes, this is subject to change.

  (Ummm...  What's the point of
 that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?  

DDT entries vary in size. More references means more bytes needed.

 Seems
 like a very questionable benefit to allow DDT entries to get evicted into
 L2ARC.)  So the ram consumption caused by the presence of l2arc will
 initially be zero after bootup, and it will grow over time as the l2arc
 populates, up to a maximum which is determined linearly as 200 bytes * the
 number of entries that can fit in the l2arc.  Of course that number varies
 based on the size of each entry and size of l2arc, but at least you can
 estimate and establish upper and lower bounds.

Yes, this is simple enough to toss into a spreadsheet.

 So that's how the l2arc consumes system memory in arc.  The penalty of
 insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
 availability for other purposes - Maybe the whole arc is consumed by l2arc
 entries, and so the arc doesn't have any room for other stuff like commonly
 used files.  

I've never witnessed such a condition and doubt that it would happen.

 Worse yet, your arc consumption could be so large, that
 PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
 out to swap space, which is really bad.

This will not happen. The ARC will be asked to shrink when other memory 
consumers demand memory. The lower bound of ARC size is c_min

# kstat -p zfs::arcstats:c_min

 Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
 to (not instead of) the fletcher2 integrity checksum.  

You are wrong, as others have pointed out.  Documented in the man page.

 So after bootup,
 while the system is reading a bunch of data from the pool, all those reads
 are not populating the arc/l2arc with DDT entries.  Reads are just
 populating the arc and l2arc with other stuff.

L2ARC is populated by a thread that watches the soon-to-be-evicted list.
If the flow through the ARC is much greater than the throttle of the L2ARC
filling thread, then the data just won't make it into the L2ARC. The thottle
changes after the ARC fills, so it can warm the L2ARC faster, but then
gets out of the way when needed.

 DDT entries don't get into the arc/l2arc until something tries to do a
 write.  When performing a write, dedup calculates the checksum of the block
 to be written, and then it needs to figure out if that's a duplicate of
 another block that's already on disk somewhere.  So (I guess this part)
 there's probably a tree-structure

AVL trees

 (I'll use the subdirectories and files
 analogy even though I'm certain that's not technically correct) on disk.
 You need to find the DDT entry, if it exists, for the block whose checksum
 is 1234ABCD.  So you start by looking under the 1 directory, and from there
 look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
 encounter not found at any step, then the DDT entry doesn't already exist
 and you decide to create a new one.  But if you get all the way down to the
 C subdirectory and it contains a file named D,  then you have found a
 possible dedup hit - the checksum matched another block that's already on
 disk.  Now the DDT entry is stored in ARC just like anything else you read
 from disk.

http://en.wikipedia.org/wiki/AVL_tree

 So the point is - Whenever you do a write, and the calculated DDT is not
 already in ARC/L2ARC, the system will actually perform several small reads
 looking for the DDT entry before it finally knows that the DDT entry
 actually exists.  So the penalty of performing a write, with dedup enabled,
 

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 (BTW, is there any way to get a measurement of number of blocks consumed
 per zpool?  Per vdev?  Per zfs filesystem?)  *snip*.
 
 
 you need to use zdb to see what the current block usage is for a
filesystem.
 I'd have to look up the particular CLI usage for that, as I don't know
what it is
 off the top of my head.

Anybody know the answer to that one?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Tomas Ögren
On 27 April, 2011 - Edward Ned Harvey sent me these 0,6K bytes:

  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Erik Trimble
  
  (BTW, is there any way to get a measurement of number of blocks consumed
  per zpool?  Per vdev?  Per zfs filesystem?)  *snip*.
  
  
  you need to use zdb to see what the current block usage is for a
 filesystem.
  I'd have to look up the particular CLI usage for that, as I don't know
 what it is
  off the top of my head.
 
 Anybody know the answer to that one?

zdb -bb pool

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 No, that's not true. The DDT is just like any other ZFS metadata and can
be
 split over the ARC,
 cache device (L2ARC) and the main pool devices. An infrequently referenced
 DDT block will get
 evicted from the ARC to the L2ARC then evicted from the L2ARC.

When somebody has their baseline system, and they're thinking about adding
dedup and/or cache, I'd like to understand the effect of not having enough
ram.  Obviously the impact will be performance, but precisely...

At bootup, I presume the arc  l2arc are all empty.  So all the DDT entries
reside in pool.  As the system reads things (anything, files etc) from pool,
it will populate arc, and follow fill rate policies to populate the l2arc
over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
memory as any other type of l2arc entry.)  (Ummm...  What's the point of
that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?  Seems
like a very questionable benefit to allow DDT entries to get evicted into
L2ARC.)  So the ram consumption caused by the presence of l2arc will
initially be zero after bootup, and it will grow over time as the l2arc
populates, up to a maximum which is determined linearly as 200 bytes * the
number of entries that can fit in the l2arc.  Of course that number varies
based on the size of each entry and size of l2arc, but at least you can
estimate and establish upper and lower bounds.

So that's how the l2arc consumes system memory in arc.  The penalty of
insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
availability for other purposes - Maybe the whole arc is consumed by l2arc
entries, and so the arc doesn't have any room for other stuff like commonly
used files.  Worse yet, your arc consumption could be so large, that
PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
out to swap space, which is really bad.

Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
to (not instead of) the fletcher2 integrity checksum.  So after bootup,
while the system is reading a bunch of data from the pool, all those reads
are not populating the arc/l2arc with DDT entries.  Reads are just
populating the arc and l2arc with other stuff.

DDT entries don't get into the arc/l2arc until something tries to do a
write.  When performing a write, dedup calculates the checksum of the block
to be written, and then it needs to figure out if that's a duplicate of
another block that's already on disk somewhere.  So (I guess this part)
there's probably a tree-structure (I'll use the subdirectories and files
analogy even though I'm certain that's not technically correct) on disk.
You need to find the DDT entry, if it exists, for the block whose checksum
is 1234ABCD.  So you start by looking under the 1 directory, and from there
look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
encounter not found at any step, then the DDT entry doesn't already exist
and you decide to create a new one.  But if you get all the way down to the
C subdirectory and it contains a file named D,  then you have found a
possible dedup hit - the checksum matched another block that's already on
disk.  Now the DDT entry is stored in ARC just like anything else you read
from disk.

So the point is - Whenever you do a write, and the calculated DDT is not
already in ARC/L2ARC, the system will actually perform several small reads
looking for the DDT entry before it finally knows that the DDT entry
actually exists.  So the penalty of performing a write, with dedup enabled,
and the relevant DDT entry not already in ARC/L2ARC is a very large penalty.
What originated as a single write quickly became several small reads plus a
write, due to the fact the necessary DDT entry was not already available.

The penalty of insufficient ram, in conjunction with dedup, is terrible
write performance.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Richard Elling
On Apr 27, 2011, at 9:26 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 No, that's not true. The DDT is just like any other ZFS metadata and can
 be
 split over the ARC,
 cache device (L2ARC) and the main pool devices. An infrequently referenced
 DDT block will get
 evicted from the ARC to the L2ARC then evicted from the L2ARC.
 
 When somebody has their baseline system, and they're thinking about adding
 dedup and/or cache, I'd like to understand the effect of not having enough
 ram.  Obviously the impact will be performance, but precisely...

Pecision is only possible if you know what the data looks like...

 At bootup, I presume the arc  l2arc are all empty.  So all the DDT entries
 reside in pool.  As the system reads things (anything, files etc) from pool,
 it will populate arc, and follow fill rate policies to populate the l2arc
 over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
 what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
 memory as any other type of l2arc entry.)  (Ummm...  What's the point of
 that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?

No. The DDT entries vary in size.

  Seems
 like a very questionable benefit to allow DDT entries to get evicted into
 L2ARC.)  So the ram consumption caused by the presence of l2arc will
 initially be zero after bootup, and it will grow over time as the l2arc
 populates, up to a maximum which is determined linearly as 200 bytes * the
 number of entries that can fit in the l2arc.  Of course that number varies
 based on the size of each entry and size of l2arc, but at least you can
 estimate and establish upper and lower bounds.

The upper and lower bounds vary by 256x, unless you know what the data
looks like more precisely.

 So that's how the l2arc consumes system memory in arc.  The penalty of
 insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
 availability for other purposes - Maybe the whole arc is consumed by l2arc
 entries, and so the arc doesn't have any room for other stuff like commonly
 used files.  

I've never seen this.

 Worse yet, your arc consumption could be so large, that
 PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
 out to swap space, which is really bad.

[for Solaris, illumos, and NexentaOS]
This will not happen unless the ARC size is at arc_min. At that point you are
already close to severe memory shortfall.

 Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
 to (not instead of) the fletcher2 integrity checksum.  

You are mistaken.

 So after bootup,
 while the system is reading a bunch of data from the pool, all those reads
 are not populating the arc/l2arc with DDT entries.  Reads are just
 populating the arc and l2arc with other stuff.

L2ARC is populated by a separate thread that watches the to-be-evicted list.
The L2ARC fill rate is also throttled, so that under severe shortfall, blocks
will be evicted without being placed in the L2ARC.

 DDT entries don't get into the arc/l2arc until something tries to do a
 write.  

No, the DDT entry contains the references to the actual data.

 When performing a write, dedup calculates the checksum of the block
 to be written, and then it needs to figure out if that's a duplicate of
 another block that's already on disk somewhere.  So (I guess this part)
 there's probably a tree-structure (I'll use the subdirectories and files
 analogy even though I'm certain that's not technically correct) on disk.

Implemented as an AVL tree.

 You need to find the DDT entry, if it exists, for the block whose checksum
 is 1234ABCD.  So you start by looking under the 1 directory, and from there
 look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
 encounter not found at any step, then the DDT entry doesn't already exist
 and you decide to create a new one.  But if you get all the way down to the
 C subdirectory and it contains a file named D,  then you have found a
 possible dedup hit - the checksum matched another block that's already on
 disk.  Now the DDT entry is stored in ARC just like anything else you read
 from disk.

DDT is metadata, not data, so it is more constrained than data entries in the
ARC.

 So the point is - Whenever you do a write, and the calculated DDT is not
 already in ARC/L2ARC, the system will actually perform several small reads
 looking for the DDT entry before it finally knows that the DDT entry
 actually exists.  So the penalty of performing a write, with dedup enabled,
 and the relevant DDT entry not already in ARC/L2ARC is a very large penalty.
 What originated as a single write quickly became several small reads plus a
 write, due to the fact the necessary DDT entry was not already available.
 
 The penalty of insufficient ram, in conjunction with 

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-26 Thread Roy Sigurd Karlsbakk
- Original Message -
 On 04/25/11 11:55, Erik Trimble wrote:
  On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:
   And one more comment: Based on what's below, it seems that the DDT
   gets stored on the cache device and also in RAM. Is that correct?
   What
   if you didn't have a cache device? Shouldn't it *always* be in
   ram?
   And doesn't the cache device get wiped every time you reboot? It
   seems
   to me like putting the DDT on the cache device would be harmful...
   Is
   that really how it is?
  Nope. The DDT is stored only in one place: cache device if present,
  /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache
  device is present, the DDT is stored there, BUT RAM also must store
  a
  basic lookup table for the DDT (yea, I know, a lookup table for a
  lookup table).
 No, that's not true. The DDT is just like any other ZFS metadata and
 can be split over the ARC,
 cache device (L2ARC) and the main pool devices. An infrequently
 referenced DDT block will get
 evicted from the ARC to the L2ARC then evicted from the L2ARC.
and with the default size of a zfs configuration's metadata being (ram size - 
1GB) / 4, without tuning, and with 128kB blocks all over, you'll need some 
5-6GB+ per terabyte stored. -- Vennlige hilsener / Best regards roy -- Roy 
Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ 
-- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Roy Sigurd Karlsbakk
 After modifications that I hope are corrections, I think the post
 should look like this:
 
 The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for
 every L2ARC entry.
 
 DDT doesn't count for this ARC space usage
 
 E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out
 that I have about a 5:1 dedup ratio. I'd also like to see how much ARC
 usage I eat up with a 160GB L2ARC.
 
 (1) How many entries are there in the DDT:
 
 1TB of 4k blocks means there are 268million blocks. However, at a 5:1
 dedup ratio, I'm only actually storing 20% of that, so I have about 54
 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~
 14GB in size
 
 (2) My L2ARC is 160GB in size, but I'm using 14GB for the DDT. Thus, I
 have 146GB free for use as a data cache. 146GB / 4k =~ 38 million
 blocks can be stored in the
 remaining L2ARC space. However, 38 million files takes up: 200bytes *
 38 million =~ 7GB of space in ARC.
 
 Thus, I better spec my system with (whatever base RAM for basic OS and
 cache and application requirements) + 14G because of dedup + 7G
 because of L2ARC.

Thanks, but one more ting: Add some tuning parameters, such as set 
zfs:zfs_arc_meta_limit = somevalue in /etc/system to help zfs use more memory 
for its metadata (like the DDT), as it won't use more than (RAM-1GB)/4 by 
default

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Erik Trimble

On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


There are a lot of conflicting references on the Internet, so I'd 
really like to solicit actual experts (ZFS developers or people who 
have physical evidence) to weigh in on this...


After searching around, the reference I found to be the most seemingly 
useful was Erik's post here:


http://opensolaris.org/jive/thread.jspa?threadID=131296

Unfortunately it looks like there's an arithmetic error (1TB of 4k 
blocks means 268million blocks, not 1 billion).  Also, IMHO it seems 
important make the distinction, #files != #blocks.  Due to the 
existence of larger files, there will sometimes be more than one block 
per file; and if I'm not mistaken, thanks to write aggregation, there 
will sometimes be more than one file per block.  YMMV.  Average block 
size could be anywhere between 1 byte and 128k assuming default 
recordsize.  (BTW, recordsize seems to be a zfs property, not a zpool 
property.  So how can you know or configure the blocksize for 
something like a zvol iscsi target?)


I said 2^30, which is roughly a quarter billion.  But, I should have 
been more exact.  And, the file != block difference is important to note.


zvols also take a Recordsize attribute. And, zvols tend to be sticklers 
about all blocks being /exactly/ the recordsize value, unlike 
filesystems, which use it as a *maximum* block size.


Min block size is 512 bytes.


(BTW, is there any way to get a measurement of number of blocks 
consumed per zpool?  Per vdev?  Per zfs filesystem?)  The calculations 
below are based on assumption of 4KB blocks adding up to a known total 
data consumption.  The actual thing that matters is the number of 
blocks consumed, so the conclusions drawn will vary enormously when 
people actually have average block sizes != 4KB.




you need to use zdb to see what the current block usage is for a 
filesystem. I'd have to look up the particular CLI usage for that, as I 
don't know what it is off the top of my head.


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


Nope. The DDT is stored only in one place: cache device if present, /or/ 
RAM otherwise (technically, ARC, but that's in RAM).  If a cache device 
is present, the DDT is stored there, BUT RAM also must store a basic 
lookup table for the DDT (yea, I know, a lookup table for a lookup table).



My minor corrections here:

The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every 
L2ARC entry, since the DDT is stored on the cache device.


the DDT itself doesn't consume any ARC space usage if stored in a L2ARC 
cache


E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out 
that I have about a 5:1 dedup ratio. I'd also like to see how much ARC 
usage I eat up with using a 160GB L2ARC to store my DDT on.


(1) How many entries are there in the DDT?

1TB of 4k blocks means there are 268million blocks.  However, at a 
5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 
54 million blocks.  Thus, I need a DDT of about 270bytes * 54 million =~ 
14GB in size


(2) How much ARC space does this DDT take up?
The 54 million entries in my DDT take up about 200bytes * 54 
million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just 
to storing the references to the DDT in the L2ARC.



(3) How much space do I have left on the L2ARC device, and how many 
blocks can that hold?
Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the 
device, which, assuming I'm still using 4k blocks, means I can cache 
about 37 million 4k blocks, or about 66% of my total data. This extra 
cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of 
ARC entries.


Thus, for the aforementioned dedup scenario, I'd better spec it with 
(whatever base RAM for basic OS and ordinary ZFS cache and application 
requirements) at least a 14G L2ARC device for dedup + 10G more of RAM 
for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional 
space in the L2ARC cache beyond that used by the DDT.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Neil Perrin

On 04/25/11 11:55, Erik Trimble wrote:

On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


 

Nope. The DDT is stored only in one place: cache device if present, 
/or/ RAM otherwise (technically, ARC, but that's in RAM).  If a cache 
device is present, the DDT is stored there, BUT RAM also must store a 
basic lookup table for the DDT (yea, I know, a lookup table for a 
lookup table).


No, that's not true. The DDT is just like any other ZFS metadata and can 
be split over the ARC,
cache device (L2ARC) and the main pool devices. An infrequently 
referenced DDT block will get

evicted from the ARC to the L2ARC then evicted from the L2ARC.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Freddie Cash
On Mon, Apr 25, 2011 at 10:55 AM, Erik Trimble erik.trim...@oracle.com wrote:
 Min block size is 512 bytes.

Technically, isn't the minimum block size 2^(ashift value)?  Thus, on
4 KB disks where the vdevs have an ashift=12, the minimum block size
will be 4 KB.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 and 128k assuming default recordsize.  (BTW, recordsize seems to be a zfs
 property, not a zpool property.  So how can you know or configure the
 blocksize for something like a zvol iscsi target?)

zvols use the 'volblocksize' property, which defaults to 8k. A 1TB
zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt
(assuming that a ddt entry is 270 bytes).

The zfs man page for the property reads:

volblocksize=blocksize

 For volumes, specifies the block size of the volume. The
 blocksize  cannot  be  changed  once the volume has been
 written, so it should be set at  volume  creation  time.
 The default blocksize for volumes is 8 Kbytes. Any power
 of 2 from 512 bytes to 128 Kbytes is valid.

 This property can also be referred to by  its  shortened
 column name, volblock.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss