Re: Data De-duplication

Oliver Mattos Wed, 10 Dec 2008 13:42:25 -0800

I see quite a few uses for this, and while it looks like the kernel mode
automatic de-dup-on-write code might be performance costly, require disk
format changes, and be controversial, it sounds like the user mode
utility could be implemented today.


It looks like a simple script could do the job - just iterate through
every file in the filesystem, run md5sum on every block of every file,
whenever a duplicate is found call an ioctl to remove the duplicate
data.  By md5summing each block it can also effectively compress disk
images.

While not very efficient it should work, and having something like this
in the toolkit would mean as soon as btrfs gets stable enough for
everyday use it would immediately out-do every other linux filesystem in
terms of space efficiency for some workloads.

In the long term kernel mode de-duplication would probably be good.  I'm
willing to bet even the average user has say 1-2% of data duplicated
somewhere on the HD due to accidental copies instead of moves, same
application installed to two different paths, two users who happen to
have the same file each saved in their home folder, etc, so even the
average user will slightly benefit.

I'm considering writing that script to test on my ext3 disk just to see
how much duplicate wasted data I really have.

Thanks
Oliver


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data De-duplication

Reply via email to