Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-15 Thread Darren J Moffat

Pawel Jakub Dawidek wrote:

On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote:

I have been looking at zfs source trying to get up to speed on the
internals.  One thing that interests me about the fs is what appears to be
a low hanging fruit for block squishing CAS (Content Addressable Storage).
I think that in addition to lzjb compression, squishing blocks that contain
the same data would buy a lot of space for administrators working in many
common workflows.

[...]

I like the idea, but I'd prefer to see such option to be per-pool, not
per-filesystem option.

I found somewhere in ZFS documentation that clones are nice to use for a
large number of diskless stations. That's fine, but after every upgrade,
more and more files are updated and fewer and fewer blocks are shared
between clones. Having such functionality for the entire pool would be a
nice optimization in this case. This doesn't have to be per-pool option
actually, but per-filesystem-hierarchy, ie. all file systems under
tank/diskless/.


Which actually says it is per filesystem and it is inherited, exactly 
how compression and the checksum algorithm are done today.  You can 
change it on the clone if you wish to.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-15 Thread Mike Gerdts

On 1/10/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM:
 On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  I think that in addition to lzjb compression, squishing blocks that
contain
  the same data would buy a lot of space for administrators working in
many
  common workflows.

 This idea has occurred to me too - I think there are definite
 advantages to 'block re-use'.
 When you start talking about multiple similar zones, I suspect
 substantial space savings could
 be made - and if you can re-use that saved storage to provide
 additional redundancy, everyone
 would be happy.


My favorite uses come to mind (I have spent a fair amount of time
wishing for this feature):

1) Zones that start out as ZFS clones will tend to diverge as the
system patches.   This will allow them to re-converge as the patches
roll through multiple zones.

2) Environments where each person starts with the same code base (hg
pull http://hg.intevation.org/mirrors/opensolaris.org/onnv-gate/) then
build it producing substantially similar object files.

3) Disk-based backup systems (de-duplication is a buzz word here)


That issue has already come up in the thread,  SHA256 is 2^128 for random,
2^80 for targeted collisions.  That is pretty darn good,  but it would also
make sense to perform a rsync like secondary check on match using a
dissimilar crypto hash.  If we hit very unlikely chance that 2 blocks match
both sha256 and whatever other secondary hash I think that block should be
lost (act of god). =)


Reading the full block and doing a full comparison is very cheap
(given the anticipated frequency) and makes you not have to explain
that the file system has a 2^512 chance of silent data corruption.  As
slim of a chance as that is, ZFS promises to not corrupt my data and
to tell on others that do.  ZFS cannot break that promise.

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-13 Thread Pawel Jakub Dawidek
On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote:
 I have been looking at zfs source trying to get up to speed on the
 internals.  One thing that interests me about the fs is what appears to be
 a low hanging fruit for block squishing CAS (Content Addressable Storage).
 I think that in addition to lzjb compression, squishing blocks that contain
 the same data would buy a lot of space for administrators working in many
 common workflows.
[...]

I like the idea, but I'd prefer to see such option to be per-pool, not
per-filesystem option.

I found somewhere in ZFS documentation that clones are nice to use for a
large number of diskless stations. That's fine, but after every upgrade,
more and more files are updated and fewer and fewer blocks are shared
between clones. Having such functionality for the entire pool would be a
nice optimization in this case. This doesn't have to be per-pool option
actually, but per-filesystem-hierarchy, ie. all file systems under
tank/diskless/.

I'm not yet sure how you can build the list of hash-to-block mappings for
large pools on boot fast...

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpIN0bljATsF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-10 Thread Wade . Stuart






Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM:

 On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  I think that in addition to lzjb compression, squishing blocks that
contain
  the same data would buy a lot of space for administrators working in
many
  common workflows.

 This idea has occurred to me too - I think there are definite
 advantages to 'block re-use'.
 When you start talking about multiple similar zones, I suspect
 substantial space savings could
 be made - and if you can re-use that saved storage to provide
 additional redundancy, everyone
 would be happy.

Very true,  even on normal fileserver usage I have historically found that
there is 15 - 30% file level duplication, when added to the cheap snapping
and the already existing compression I think this is a big big win.



  Assumptions:
 
  SHA256 hash used (Fletcher2/4 have too many collisions,  SHA256 is
2^128 if
  I remember correctly)
  SHA256 hash is taken on the data portion of the block as it exists on
disk.
  the metadata structure is hashed separately.
  In the current metadata structure, there is a reserved bit portion to
be
  used in the future.
 
 
  Description of change:
  Creates:
  The filesystem goes through its normal process of writing a block, and
  creating the checksum.
  Before the step where the metadata tree is pushed, the checksum is
checked
  against a global checksum tree to see if there is any match.
  If match exists, insert a metadata placeholder for the block, that
  references the already existing block on disk, increment a
number_of_links
  pointer on the metadata blocks to keep track of the pointers pointing
to
  this block.
  free up the new block that was written and check-summed to be used in
the
  future.
  else if no match, update the checksum tree with the new checksum and
  continue as normal.

 Unless I'm reading this wrong, this sounds a lot like Plan9s 'Venti'
 architecture
  ( http://cm.bell-labs.com/sys/doc/venti.html ) .

 But using a hash 'label' seems the wrong approach.
 ZFS is supposed to scale to terrifying levels, and the chances of a
collision,
 however small, works against that. I wouldn't want to trade
 reliability for some extra
 space.


That issue has already come up in the thread,  SHA256 is 2^128 for random,
2^80 for targeted collisions.  That is pretty darn good,  but it would also
make sense to perform a rsync like secondary check on match using a
dissimilar crypto hash.  If we hit very unlikely chance that 2 blocks match
both sha256 and whatever other secondary hash I think that block should be
lost (act of god). =)

Even with this dual check approach,  the index (and the only hash stored)
can still be just the sha256 as the chance for collision is similar to nil
in this context.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Wade . Stuart




I have been looking at zfs source trying to get up to speed on the
internals.  One thing that interests me about the fs is what appears to be
a low hanging fruit for block squishing CAS (Content Addressable Storage).
I think that in addition to lzjb compression, squishing blocks that contain
the same data would buy a lot of space for administrators working in many
common workflows.

I am writing to see if I can get some feedback from people that know the
code better than I -- are there any gotchas in my logic?

Assumptions:

SHA256 hash used (Fletcher2/4 have too many collisions,  SHA256 is 2^128 if
I remember correctly)
SHA256 hash is taken on the data portion of the block as it exists on disk.
the metadata structure is hashed separately.
In the current metadata structure, there is a reserved bit portion to be
used in the future.


Description of change:
Creates:
The filesystem goes through its normal process of writing a block, and
creating the checksum.
Before the step where the metadata tree is pushed, the checksum is checked
against a global checksum tree to see if there is any match.
If match exists, insert a metadata placeholder for the block, that
references the already existing block on disk, increment a number_of_links
pointer on the metadata blocks to keep track of the pointers pointing to
this block.
free up the new block that was written and check-summed to be used in the
future.
else if no match, update the checksum tree with the new checksum and
continue as normal.


Deletes:
normal process, except verifying that the number_of_links count is lowered
and if it is non zero then do not free the block.
clean up checksum tree as needed.

What this requires:
A new flag in metadata that can tag the block as a CAS block.
A checksum tree that allows easy fast lookup of checksum keys.
a counter in the metadata or hash tree that tracks links back to blocks.
Some additions to the userland apps to push the config/enable modes.

Does this seem feasible?  Are there any blocking points that I am missing
or unaware of?   I am just posting this for discussion,  it seems very
interesting to me.

-Wade

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Bart Smaalders

[EMAIL PROTECTED] wrote:




I have been looking at zfs source trying to get up to speed on the
internals.  One thing that interests me about the fs is what appears to be
a low hanging fruit for block squishing CAS (Content Addressable Storage).
I think that in addition to lzjb compression, squishing blocks that contain
the same data would buy a lot of space for administrators working in many
common workflows.

I am writing to see if I can get some feedback from people that know the
code better than I -- are there any gotchas in my logic?

Assumptions:

SHA256 hash used (Fletcher2/4 have too many collisions,  SHA256 is 2^128 if
I remember correctly)
SHA256 hash is taken on the data portion of the block as it exists on disk.
the metadata structure is hashed separately.
In the current metadata structure, there is a reserved bit portion to be
used in the future.


Description of change:
Creates:
The filesystem goes through its normal process of writing a block, and
creating the checksum.
Before the step where the metadata tree is pushed, the checksum is checked
against a global checksum tree to see if there is any match.
If match exists, insert a metadata placeholder for the block, that
references the already existing block on disk, increment a number_of_links
pointer on the metadata blocks to keep track of the pointers pointing to
this block.
free up the new block that was written and check-summed to be used in the
future.
else if no match, update the checksum tree with the new checksum and
continue as normal.


Deletes:
normal process, except verifying that the number_of_links count is lowered
and if it is non zero then do not free the block.
clean up checksum tree as needed.

What this requires:
A new flag in metadata that can tag the block as a CAS block.
A checksum tree that allows easy fast lookup of checksum keys.
a counter in the metadata or hash tree that tracks links back to blocks.
Some additions to the userland apps to push the config/enable modes.

Does this seem feasible?  Are there any blocking points that I am missing
or unaware of?   I am just posting this for discussion,  it seems very
interesting to me.



Note that you'd actually have to verify that the blocks were the same;
you cannot count on the hash function.  If you didn't do this, anyone
discovering a collision could destroy the colliding blocks/files.
Val Henson wrote a paper on this topic; there's a copy here:

http://infohost.nmt.edu/~val/review/hash.pdf

- Bart

Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Bill Sommerfeld
 Note that you'd actually have to verify that the blocks were the same;
 you cannot count on the hash function.  If you didn't do this, anyone
 discovering a collision could destroy the colliding blocks/files.

Given that nobody knows how to find sha256 collisions, you'd of course
need to test this code with a weaker hash algorithm.

(It would almost be worth it to have the code panic in the event that a
real sha256 collision was found)

- Bill






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Wade . Stuart




 
  Does this seem feasible?  Are there any blocking points that I am
missing
  or unaware of?   I am just posting this for discussion,  it seems very
  interesting to me.
 

 Note that you'd actually have to verify that the blocks were the same;
 you cannot count on the hash function.  If you didn't do this, anyone
 discovering a collision could destroy the colliding blocks/files.
 Val Henson wrote a paper on this topic; there's a copy here:

Sure,  that makes sense.  I do not see why that would be much of a problem
beyond if sha256 hash match, then do yet one more crypto hash of your
choice to verify they are indeed the same blocks (fool me once, shame on
me...),  the hash key should be able to be based on only the sha256 marker
then.  If we do find a natural collision,  then a special code path (and
email to nsa =) could be in order.




 http://infohost.nmt.edu/~val/review/hash.pdf

 - Bart

 Bart Smaalders Solaris Kernel Performance
 [EMAIL PROTECTED]  http://blogs.sun.com/barts

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Torrey McMahon

[EMAIL PROTECTED] wrote:



  

Does this seem feasible?  Are there any blocking points that I am
  

missing
  

or unaware of?   I am just posting this for discussion,  it seems very
interesting to me.

  

Note that you'd actually have to verify that the blocks were the same;
you cannot count on the hash function.  If you didn't do this, anyone
discovering a collision could destroy the colliding blocks/files.
Val Henson wrote a paper on this topic; there's a copy here:



Sure,  that makes sense.  I do not see why that would be much of a problem
beyond if sha256 hash match, then do yet one more crypto hash of your
choice to verify they are indeed the same blocks (fool me once, shame on
me...),  the hash key should be able to be based on only the sha256 marker
then.  If we do find a natural collision,  then a special code path (and
email to nsa =) could be in order.

  


Is Honeycomb doing anything in this space?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Wade . Stuart





Bill Sommerfeld [EMAIL PROTECTED] wrote on 01/08/2007 03:41:53 PM:

  Note that you'd actually have to verify that the blocks were the same;
  you cannot count on the hash function.  If you didn't do this, anyone
  discovering a collision could destroy the colliding blocks/files.

 Given that nobody knows how to find sha256 collisions, you'd of course
 need to test this code with a weaker hash algorithm.

 (It would almost be worth it to have the code panic in the event that a
 real sha256 collision was found)

- Bill


That reminds me,  I had a few more questions about this.

1, If a fs was started with a fletcher hash,  and switched later to sha256,
is there a way to resilver the hashes to sha256 that existed before the
set?

2, Also is there any way to get zdb to dump a list of blocks and their
associated hashes (zdb seems to be lightly documented and the source files
for it require a little more familiarity with zfs internals than I have
groked yet).



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss