Re: Possible to deduplicate read-only snapshots for space-efficient backups

Gabriel de Perthuis Tue, 07 May 2013 16:35:27 -0700

On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
> Gabriel de Perthuis <g2p.c...@gmail.com> schrieb:
>> It sounds simple, and was sort-of prompted by the new syscall taking
>> short ranges, but it is tricky figuring out a sane heuristic (when to
>> hash, when to bail, when to submit without comparing, what should be the
>> source in the last case), and it's not something I have an immediate
>> need for.  It is also possible to use 9p (with standard cow and/or
>> small-file dedup) and trade a bit of configuration for much more
>> space-efficient VMs.
>> 
>> Finer-grained tracking of which ranges have changed, and maybe some
>> caching of range hashes, would be a good first step before doing any
>> crazy large-file heuristics.  The hash caching would actually benefit
>> all use cases.
> 
> Looking back to good old peer-2-peer days (I think we all got in touch with 
> that the one or the other way), one title pops back into my mind: tiger-
> tree-hash...
> 
> I'm not really into it, but would it be possible to use tiger-tree-hashes to 
> find identical blocks? Even accross different sized files...


Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to deduplicate read-only snapshots for space-efficient backups

Reply via email to