Content based storage

David Brown Tue, 16 Mar 2010 02:30:27 -0700

Hi,

I was wondering if there has been any thought or progress incontent-based storage for btrfs beyond the suggestion in the "Projectideas" wiki page?

The basic idea, as I understand it, is that a longer data extentchecksum is used (long enough to make collisions unrealistic), and mergedata extents with the same checksums. The result is that "cp foo bar"will have pretty much the same effect as "cp --reflink foo bar" - thetwo copies will share COW data extents - as long as they remain thesame, they will share the disk space. But you can still access eachfile independently, unlike with a traditional hard link.

I can see at least three cases where this could be a big win - I'm surethere are more.

Developers often have multiple copies of source code trees as branches,snapshots, etc. For larger projects (I have multiple "buildroot" treesfor one project) this can take a lot of space. Content-based storagewould give the space efficiency of hard links with the independence ofstraight copies. Using "cp --reflink" would help for the initialsnapshot or branch, of course, but it could not help after the copy.

On servers using lightweight virtual servers such as OpenVZ, you havemultiple "root" file systems each with their own copy of "/usr", etc.With OpenVZ, all the virtual roots are part of the host's file system(i.e., not hidden within virtual disks), so content-based storage couldmerge these, making them very much more efficient. Because each ofthese virtual roots can be updated independently, it is not possible touse "cp --reflink" to keep them merged.

For backup systems, you will often have multiple copies of the samefiles. A common scheme is to use rsync and "cp -al" to make hard-linked(and therefore space-efficient) snapshots of the trees. But sometimesthese things get out of synchronisation - perhaps your remote rsync dieshalfway, and you end up with multiple independent copies of the samefiles. Content-based storage can then re-merge these files.

I would imagine that content-based storage will sometimes be aperformance win, sometimes a loss. It would be a win when mergingresults in better use of the file system cache - OpenVZ virtual servingwould be an example where you would be using multiple copies of the samefile at the same time. For other uses, such as backups, there would beno performance gain since you seldom (hopefully!) read the backup files.But in that situation, speed is not a major issue.



mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Content based storage

Reply via email to