Re: Data De-duplication

Oliver Mattos Thu, 11 Dec 2008 01:59:45 -0800

> Neat.  Thanks much.  It'd be cool to output the results of each of your
> hashes to a database so you can get a feel for how many duplicate
> blocks there are cross-files as well.
> 
> I'd like to run this in a similar setup on all my VMware VMDK files and
> get an idea of how much space savings there would be across 20+ Windows
> 2003 VMDK files... probably *lots* of common blocks.
> 
> Ray
Hi,


Currently it DOES do cross file block matching - thats why it takes sooo
long to run :-)

If you remove everything after the word "sort", it will make a verbose
output, which you could then stick into some SQL database if you wanted.
You could relativey easily format the output into a format for direct
input to an SQL database if you modified the line with the "dd" in it
within the first while.  You can also remove the "sort" and the pipe
before it to get an unsorted output - the advantage of this is it takes
less time.

I'm guessing that if you had the time to run this on multi-gigabyte disk
images you'd find that as much as 80% of the blocks are duplicated
between any two virtual machines of the same operating system.

That means if you have 20 Win 2k3 VM's and the first VM image of Windows
+ software is 2GB (after nulls are removed), the total size for 20VM's
could be ~6GB (remembering there will be extra redundancy the more VM's
you add)- not a bad saving.

Thanks
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data De-duplication

Reply via email to