Re: Offline Deduplication for Btrfs

Gordan Bobic Thu, 06 Jan 2011 07:07:10 -0800

Peter A wrote:

On Thursday, January 06, 2011 09:00:47 am you wrote:

Peter A wrote:

I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty file in this dir with a name that just happens to be the first in
the dir. This adds 512 bytes close to the beginning of the tar file the
second time I run tar. Now the remainder of the is all offset by
512bytes and, if you do dedupe on fs- block sized chunks larger than the
512bytes, not a single byte will be de- duped.

OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.

<snip>

Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.

Glad I finally explained myself properly... Unfortunately I disagree with youon the rest. If you take that logic, then I could claim dedupe is nothing afile system should handle - after all, its the user's poorly designedapplications that store multiple copies of data. Why should the fs take careof that?

There is merit in that point. Some applications do in fact do their owndeduplication, as mentioned previously on this thread.

The problem doesn't just affect backups. It affects everything where you havelarge data files that are not forced to allign with filesystem blocks. Inaddition to the case I mentioned above this affects in pretty much the sameeffectiveness:* Database dumps* Video Editing* Files backing iSCSI volumes* VM Images (fs blocks inside the VM rarely align with fs blocks in thebacking storage). Our VM environment is backed with a 7410 and we get onlyabout 10% dedupe. Copying the same images to a DataDomain results in a 60%reduction in space used.

I'd be interested to hear about the relative write performance on thevariable block size.

I also have to argue the point that these usages are "poorly designed". Poorlydesigned can only apply to technologies that existed or were talked about atthe time the design was made. Tar and such have been around for a long time,way before anyone even though of dedupe. In addition, until there is acommonly accepted/standard API to query the block size so apps can generatefiles appropriately laid out for the backing filesystem, what is the applicationsupposed to do?

Indeed, this goes philosophically in the direction that storage vendorshave been (unsuccessfully) been trying to shift the industry for decades- object based storage. But, sadly, it hasn't happened (yet?).

If anything, I would actually argue the opposite, that fixed block dedupe is apoor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has beenaround for a long time and is unlikely to change in the near future* There is no standard API to query the allignment parameters (and even thatwould not be great since copying a file alligned for 8k to a 16k allignedfilesystem, would potentially cause the same issue again)
Also from the human perspective its hard to make end users understand yourpoint of view. I promote the 7000 series of storage and I know how hard it isto explain the dedupe behavior there. They see that Datadomain does it, anddoes it well. So why can't solution xyz do it just as good?

I'd be interested to see the evidence of the "variable length" argument.I have a sneaky suspicion that it actually falls back to 512 byteblocks, which are much more likely to align, when more sensibly sizedblocks fail. The downside is that you don't really want to store a 32byte hash key with every 512 bytes of data, so you could peel off 512byte blocks off the front in a hope that a bigger block that followswill match.

Thinking about it, this might actually not be too expensive to do. Ifthe 4KB block doesn't match, check 512 byte sub-blocks, and try peelingthem, to make the next one line up. If it doesn't, store the mismatch asa full 4KB block and resume. If you do find a match, save the peeled 512byte blocks separately and dedupe the 4KB block.

In fact, it's rather like the loop peeling optimization on a compiler,that allows you to align the data to the boundary suitable for vectorizing.

Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.

I'd agree if it was just limited to netbackup... I know variable block lengthis a significantly more difficult problem than block level. That's why the ZFSteam made the design choice they did. Variable length is also the reason whythe DataDomain solution is a scale out rather than scalue up approach.However, CPUs get faster and faster - eventually they'll be able to handle it.So the right solution (from my limited point of view, as I said, I'm not afilesystem design expert) would be to implement the data structures to handlevariable length. Then in the first iteration, implement the dedupe algorithm toonly search on filesystem blocks using existing checksums and such. Less CPUusage, quicker development, easier debugging. Once that is stable and proven,you can then without requiring the user to reformat, go ahead and implementvariable length dedupe...

Actually, see above - I believe I was wrong about how expensive"variable length" block size is likely to be. It's more expensive, sure,but not orders of magnitude more expensive, and as discussed earlier,given the CPU isn't really the key bottleneck here, I think it'd bequite workable.

Btw, thanks for your time, Gordan :)


You're welcome. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Offline Deduplication for Btrfs

Reply via email to