On Tue, Jul 22, 2008 at 11:19 AM, <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM: > > > To do dedup properly, it seems like there would have to be some > > overly complicated methodology for a sort of delayed dedup of the > > data. For speed, you'd want your writes to go straight into the > > cache and get flushed out as quickly as possibly, keep everything as > > ACID as possible. Then, a dedup scrubber would take what was > > written, do the voodoo magic of checksumming the new data, scanning > > the tree to see if there are any matches, locking the duplicates, > > run the usage counters up or down for that block of data, swapping > > out inodes, and marking the duplicate data as free space. > I agree, but what you are describing is file based dedup, ZFS already has > the groundwork for dedup in the system (block level checksuming and > pointers). > > > It's a > > lofty goal, but one that is doable. I guess this is only necessary > > if deduplication is done at the file level. If done at the block > > level, it could possibly be done on the fly, what with the already > > implemented checksumming at the block level, > > exactly -- that is why it is attractive for ZFS, so much of the groundwork > is done and needed for the fs/pool already. > > > but then your reads > > will suffer because pieces of files can potentially be spread all > > over hell and half of Georgia on the zdevs. > > I don't know that you can make this statement without some study of an > actual implementation on real world data -- and then because it is block > based, you should see varying degrees of this dedup-flack-frag depending > on data/usage.
It's just a NonScientificWAG. I agree that most of the duplicated blocks will in most cases be part of identical files anyway, and thus lined up exactly as you'd want them. I was just free thinking and typing. > > > For instance, I would imagine that in many scenarios much od the dedup > data blocks would belong to the same or very similar files. In this case > the blocks were written as best they could on the first write, the deduped > blocks would point to a pretty sequential line o blocks. Now on some files > there may be duplicate header or similar portions of data -- these may > cause you to jump around the disk; but I do not know how much this would be > hit or impact real world usage. > > > > Deduplication is going > > to require the judicious application of hallucinogens and man hours. > > I expect that someone is up to the task. > > I would prefer the coder(s) not be seeing "pink elephants" while writing > this, but yes it can and will be done. It (I believe) will be easier > after the grow/shrink/evac code paths are in place though. Also, the > grow/shrink/evac path allows (if it is done right) for other cool things > like a base to build a roaming defrag that takes into account snaps, > clones, live and the like. I know that some feel that the grow/shrink/evac > code is more important for home users, but I think that it is super > important for most of these additional features. The elephants are just there to keep the coders company. There are tons of benefits for dedup, both for home and non-home users. I'm happy that it's going to be done. I expect the first complaints will come from those people who don't understand it, and their df and du numbers look different than their zpool status ones. Perhaps df/du will just have to be faked out for those folks, or we just apply the same hallucinogens to them instead. > > > -Wade > > > On Tue, Jul 22, 2008 at 10:39 AM, <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM: > > > > > > Hi All > > > >Is there any hope for deduplication on ZFS ? > > > >Mertol Ozyoney > > > >Storage Practice - Sales Manager > > > >Sun Microsystems > > > > Email [EMAIL PROTECTED] > > > > > > There is always hope. > > > > > > Seriously thought, looking at http://en.wikipedia. > > > org/wiki/Comparison_of_revision_control_software there are a lot of > > > choices of how we could implement this. > > > > > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge > > > one of those with ZFS. > > > > > > It _could_ be as simple (with SVN as an example) of using directory > > > listings to produce files which were then 'diffed'. You could then > > > view the diffs as though they were changes made to lines of source > code. > > > > > > Just add a "tree" subroutine to allow you to grab all the diffs that > > > referenced changes to file 'xyz' and you would have easy access to > > > all the changes of a particular file (or directory). > > > > > > With the speed optimized ability added to use ZFS snapshots with the > > > "tree subroutine" to rollback a single file (or directory) you could > > > undo / redo your way through the filesystem. > > > > > > > > dedup is not revision control, you seem to completely misunderstand the > > problem. > > > > > > > > > Using a LKCD > (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html > > > ) you could "sit out" on the play and watch from the sidelines -- > > > returning to the OS when you thought you were 'safe' (and if not, > > > jumping backout). > > > > > > Now it seems you have veered even further off course. What are you > > implying the LKCD has to do with zfs, solaris, dedup, let alone revision > > control software? > > > > -Wade > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > > > -- > > chris -at- microcozm -dot- net > > === Si Hoc Legere Scis Nimium Eruditionis Habes > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss