Re: Content based storage

Fabio Tue, 16 Mar 2010 15:55:50 -0700

Some years ago I was searching for that kind of functionality and foundan experimental ext3 patch to allow the so-called COW-links:http://lwn.net/Articles/76616/


There was a discussion later on LWN http://lwn.net/Articles/77972/
an approach like COW-links would break POSIX standards.


I am not very technical and don't know if it's feasible in btrfs.

I think most likely you'll have to run an userspace tool to find andmerge identical files based on checksums (which already sounds good to me).The only thing we can ask the developers at the moment is if somethinglike that would be possible without changes to the on-disk format.

PS. Another great scenario is shared hosting web/file servers: ten ofthousand website with mostly the same tiny PHP Joomla files.If you can get the benefits of: compression + "content based"/cowlinks +FS Cache... That would really make Btrfs FLY on Hard Disk and make SSDdevices possible for storage (because of the space efficiency).


--
Fabio


David Brown ha scritto:

Hi,
I was wondering if there has been any thought or progress incontent-based storage for btrfs beyond the suggestion in the "Projectideas" wiki page?
The basic idea, as I understand it, is that a longer data extentchecksum is used (long enough to make collisions unrealistic), andmerge data extents with the same checksums. The result is that "cpfoo bar" will have pretty much the same effect as "cp --reflink foobar" - the two copies will share COW data extents - as long as theyremain the same, they will share the disk space. But you can stillaccess each file independently, unlike with a traditional hard link.
I can see at least three cases where this could be a big win - I'msure there are more.
Developers often have multiple copies of source code trees asbranches, snapshots, etc. For larger projects (I have multiple"buildroot" trees for one project) this can take a lot of space.Content-based storage would give the space efficiency of hard linkswith the independence of straight copies. Using "cp --reflink" wouldhelp for the initial snapshot or branch, of course, but it could nothelp after the copy.
On servers using lightweight virtual servers such as OpenVZ, you havemultiple "root" file systems each with their own copy of "/usr", etc.With OpenVZ, all the virtual roots are part of the host's file system(i.e., not hidden within virtual disks), so content-based storagecould merge these, making them very much more efficient. Because eachof these virtual roots can be updated independently, it is notpossible to use "cp --reflink" to keep them merged.
For backup systems, you will often have multiple copies of the samefiles. A common scheme is to use rsync and "cp -al" to makehard-linked (and therefore space-efficient) snapshots of the trees.But sometimes these things get out of synchronisation - perhaps yourremote rsync dies halfway, and you end up with multiple independentcopies of the same files. Content-based storage can then re-mergethese files.
I would imagine that content-based storage will sometimes be aperformance win, sometimes a loss. It would be a win when mergingresults in better use of the file system cache - OpenVZ virtualserving would be an example where you would be using multiple copiesof the same file at the same time. For other uses, such as backups,there would be no performance gain since you seldom (hopefully!) readthe backup files. But in that situation, speed is not a major issue.
mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Content based storage

Reply via email to