Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Pawel Jakub Dawidek wrote: On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. [...] I like the idea, but I'd prefer to see such option to be per-pool, not per-filesystem option. I found somewhere in ZFS documentation that clones are nice to use for a large number of diskless stations. That's fine, but after every upgrade, more and more files are updated and fewer and fewer blocks are shared between clones. Having such functionality for the entire pool would be a nice optimization in this case. This doesn't have to be per-pool option actually, but per-filesystem-hierarchy, ie. all file systems under tank/diskless/. Which actually says it is per filesystem and it is inherited, exactly how compression and the checksum algorithm are done today. You can change it on the clone if you wish to. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
On 1/10/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM: On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. This idea has occurred to me too - I think there are definite advantages to 'block re-use'. When you start talking about multiple similar zones, I suspect substantial space savings could be made - and if you can re-use that saved storage to provide additional redundancy, everyone would be happy. My favorite uses come to mind (I have spent a fair amount of time wishing for this feature): 1) Zones that start out as ZFS clones will tend to diverge as the system patches. This will allow them to re-converge as the patches roll through multiple zones. 2) Environments where each person starts with the same code base (hg pull http://hg.intevation.org/mirrors/opensolaris.org/onnv-gate/) then build it producing substantially similar object files. 3) Disk-based backup systems (de-duplication is a buzz word here) That issue has already come up in the thread, SHA256 is 2^128 for random, 2^80 for targeted collisions. That is pretty darn good, but it would also make sense to perform a rsync like secondary check on match using a dissimilar crypto hash. If we hit very unlikely chance that 2 blocks match both sha256 and whatever other secondary hash I think that block should be lost (act of god). =) Reading the full block and doing a full comparison is very cheap (given the anticipated frequency) and makes you not have to explain that the file system has a 2^512 chance of silent data corruption. As slim of a chance as that is, ZFS promises to not corrupt my data and to tell on others that do. ZFS cannot break that promise. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. [...] I like the idea, but I'd prefer to see such option to be per-pool, not per-filesystem option. I found somewhere in ZFS documentation that clones are nice to use for a large number of diskless stations. That's fine, but after every upgrade, more and more files are updated and fewer and fewer blocks are shared between clones. Having such functionality for the entire pool would be a nice optimization in this case. This doesn't have to be per-pool option actually, but per-filesystem-hierarchy, ie. all file systems under tank/diskless/. I'm not yet sure how you can build the list of hash-to-block mappings for large pools on boot fast... -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpIN0bljATsF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM: On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. This idea has occurred to me too - I think there are definite advantages to 'block re-use'. When you start talking about multiple similar zones, I suspect substantial space savings could be made - and if you can re-use that saved storage to provide additional redundancy, everyone would be happy. Very true, even on normal fileserver usage I have historically found that there is 15 - 30% file level duplication, when added to the cheap snapping and the already existing compression I think this is a big big win. Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Unless I'm reading this wrong, this sounds a lot like Plan9s 'Venti' architecture ( http://cm.bell-labs.com/sys/doc/venti.html ) . But using a hash 'label' seems the wrong approach. ZFS is supposed to scale to terrifying levels, and the chances of a collision, however small, works against that. I wouldn't want to trade reliability for some extra space. That issue has already come up in the thread, SHA256 is 2^128 for random, 2^80 for targeted collisions. That is pretty darn good, but it would also make sense to perform a rsync like secondary check on match using a dissimilar crypto hash. If we hit very unlikely chance that 2 blocks match both sha256 and whatever other secondary hash I think that block should be lost (act of god). =) Even with this dual check approach, the index (and the only hash stored) can still be just the sha256 as the chance for collision is similar to nil in this context. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. I am writing to see if I can get some feedback from people that know the code better than I -- are there any gotchas in my logic? Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Deletes: normal process, except verifying that the number_of_links count is lowered and if it is non zero then do not free the block. clean up checksum tree as needed. What this requires: A new flag in metadata that can tag the block as a CAS block. A checksum tree that allows easy fast lookup of checksum keys. a counter in the metadata or hash tree that tracks links back to blocks. Some additions to the userland apps to push the config/enable modes. Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
[EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. I am writing to see if I can get some feedback from people that know the code better than I -- are there any gotchas in my logic? Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Deletes: normal process, except verifying that the number_of_links count is lowered and if it is non zero then do not free the block. clean up checksum tree as needed. What this requires: A new flag in metadata that can tag the block as a CAS block. A checksum tree that allows easy fast lookup of checksum keys. a counter in the metadata or hash tree that tracks links back to blocks. Some additions to the userland apps to push the config/enable modes. Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: http://infohost.nmt.edu/~val/review/hash.pdf - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Given that nobody knows how to find sha256 collisions, you'd of course need to test this code with a weaker hash algorithm. (It would almost be worth it to have the code panic in the event that a real sha256 collision was found) - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: Sure, that makes sense. I do not see why that would be much of a problem beyond if sha256 hash match, then do yet one more crypto hash of your choice to verify they are indeed the same blocks (fool me once, shame on me...), the hash key should be able to be based on only the sha256 marker then. If we do find a natural collision, then a special code path (and email to nsa =) could be in order. http://infohost.nmt.edu/~val/review/hash.pdf - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
[EMAIL PROTECTED] wrote: Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: Sure, that makes sense. I do not see why that would be much of a problem beyond if sha256 hash match, then do yet one more crypto hash of your choice to verify they are indeed the same blocks (fool me once, shame on me...), the hash key should be able to be based on only the sha256 marker then. If we do find a natural collision, then a special code path (and email to nsa =) could be in order. Is Honeycomb doing anything in this space? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Bill Sommerfeld [EMAIL PROTECTED] wrote on 01/08/2007 03:41:53 PM: Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Given that nobody knows how to find sha256 collisions, you'd of course need to test this code with a weaker hash algorithm. (It would almost be worth it to have the code panic in the event that a real sha256 collision was found) - Bill That reminds me, I had a few more questions about this. 1, If a fs was started with a fletcher hash, and switched later to sha256, is there a way to resilver the hashes to sha256 that existed before the set? 2, Also is there any way to get zdb to dump a list of blocks and their associated hashes (zdb seems to be lightly documented and the source files for it require a little more familiarity with zfs internals than I have groked yet). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss