Tomasz Chmielewski wrote:
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.

Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time.

Multi-gigabyte files which only have few megabytes changed can't be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated).

And I don't even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms.


Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes.

On disk however, rsync creates a "temporary copy" of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have:

original_file.bin
.temporary_random_name

Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again.

You can tell rsync to either modify the file in place (--inplace) or to put the temp file somewhere else (--temp-dir=DIR).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to