Peter A wrote:
On Thursday, January 06, 2011 09:00:47 am you wrote:
Peter A wrote:
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty file in this dir with a name that just happens to be the first in
the dir. This adds 512 bytes close to the beginning of the tar file the
second time I run tar. Now the remainder of the is all offset by
512bytes and, if you do dedupe on fs- block sized chunks larger than the
512bytes, not a single byte will be de- duped.
OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.
<snip>
Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you on the rest. If you take that logic, then I could claim dedupe is nothing a file system should handle - after all, its the user's poorly designed applications that store multiple copies of data. Why should the fs take care of that?

There is merit in that point. Some applications do in fact do their own deduplication, as mentioned previously on this thread.

The problem doesn't just affect backups. It affects everything where you have large data files that are not forced to allign with filesystem blocks. In addition to the case I mentioned above this affects in pretty much the same effectiveness: * Database dumps * Video Editing * Files backing iSCSI volumes * VM Images (fs blocks inside the VM rarely align with fs blocks in the backing storage). Our VM environment is backed with a 7410 and we get only about 10% dedupe. Copying the same images to a DataDomain results in a 60% reduction in space used.

I'd be interested to hear about the relative write performance on the variable block size.

I also have to argue the point that these usages are "poorly designed". Poorly designed can only apply to technologies that existed or were talked about at the time the design was made. Tar and such have been around for a long time, way before anyone even though of dedupe. In addition, until there is a commonly accepted/standard API to query the block size so apps can generate files appropriately laid out for the backing filesystem, what is the application supposed to do?

Indeed, this goes philosophically in the direction that storage vendors have been (unsuccessfully) been trying to shift the industry for decades - object based storage. But, sadly, it hasn't happened (yet?).

If anything, I would actually argue the opposite, that fixed block dedupe is a poor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been around for a long time and is unlikely to change in the near future * There is no standard API to query the allignment parameters (and even that would not be great since copying a file alligned for 8k to a 16k alligned filesystem, would potentially cause the same issue again)

Also from the human perspective its hard to make end users understand your point of view. I promote the 7000 series of storage and I know how hard it is to explain the dedupe behavior there. They see that Datadomain does it, and does it well. So why can't solution xyz do it just as good?

I'd be interested to see the evidence of the "variable length" argument. I have a sneaky suspicion that it actually falls back to 512 byte blocks, which are much more likely to align, when more sensibly sized blocks fail. The downside is that you don't really want to store a 32 byte hash key with every 512 bytes of data, so you could peel off 512 byte blocks off the front in a hope that a bigger block that follows will match.

Thinking about it, this might actually not be too expensive to do. If the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling them, to make the next one line up. If it doesn't, store the mismatch as a full 4KB block and resume. If you do find a match, save the peeled 512 byte blocks separately and dedupe the 4KB block.

In fact, it's rather like the loop peeling optimization on a compiler, that allows you to align the data to the boundary suitable for vectorizing.

Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.
>
I'd agree if it was just limited to netbackup... I know variable block length is a significantly more difficult problem than block level. That's why the ZFS team made the design choice they did. Variable length is also the reason why the DataDomain solution is a scale out rather than scalue up approach. However, CPUs get faster and faster - eventually they'll be able to handle it. So the right solution (from my limited point of view, as I said, I'm not a filesystem design expert) would be to implement the data structures to handle variable length. Then in the first iteration, implement the dedupe algorithm to only search on filesystem blocks using existing checksums and such. Less CPU usage, quicker development, easier debugging. Once that is stable and proven, you can then without requiring the user to reformat, go ahead and implement variable length dedupe...

Actually, see above - I believe I was wrong about how expensive "variable length" block size is likely to be. It's more expensive, sure, but not orders of magnitude more expensive, and as discussed earlier, given the CPU isn't really the key bottleneck here, I think it'd be quite workable.

Btw, thanks for your time, Gordan :)

You're welcome. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to