Re: Offline Deduplication for Btrfs

Gordan Bobic Wed, 05 Jan 2011 09:49:26 -0800

Josef Bacik wrote:

Basically I think online dedup is huge waste of time and completely useless.

I couldn't disagree more. First, let's consider what is thegeneral-purpose use-case of data deduplication. What are the resourcerequirements to perform it? How do these resource requirements differbetween online and offline?

The only sane way to keep track of hashes of existing blocks is using anindex. Searches through an index containing evenly distributed data(such as hashes) is pretty fast (log(N)), and this has to be doneregardless of whether the dedupe is online or offline. It also goeswithout saying that all the blocks being deduplicated need to be hashed,and the cost of this is also the same whether the block is hashes onlineor offline.


Let's look at the relative merits:

1a) Offline

We have to copy the entire data set. This means we are using the fullamount of disk writes that the data set size dictates. Do we do thehashing of current blocks at this point to create the indexes? Or do wedefer it until some later time?

Doing it at the point of writes is cheaper - we already have the data inRAM and we can calculate the hashes as we are writing each block.Performance implications of this are fairly analogous to the parity RAIDRMW performance issue - to achieve decent performance you have to writethe parity at the same time as the rest of the stripe, otherwise youhave to read the part of the stripe you didn't write, before calculatingthe checksum.

So by doing the hash indexing offline, the total amount of disk I/Orequired effectively doubles, and the amount of CPU spent on doing thehashing is in no way reduced.


How is this in any way advantageous?

1b) Online

As we are writing the data, we calculate the hashes for each block. (See1a for argument of why I believe this is saner and cheaper than doing itoffline.) Since we already have these hashes, we can do a look-up in thehash-index, and either write out the block as is (if that hash isn'talready in the index) or simply write the pointer to an existingsuitable block (if it already exists). This saves us writing out thatblock - fewer writes to the disk, not to mention we don't later have tore-read the block to dedupe it.

So in this case, instead of write-read-relink of the offline scenario,we simply do relink on duplicate blocks.

There is another reason to favour the online option due to it's lowerwrite stress - SSDs. Why hammer the SSD with totally unnecessary writes?

The _only_ reason to defer deduping is that hashing costs CPU time. Butthe chances are that a modern CPU core can churn out MD5 and/or SHA256hashes faster than a modern mechanical disk can keep up. A 15,000rpmdisk can theoretically handle 250 IOPS. A modern CPU can handleconsiderably more than 250 block hashings per second. You could arguethat this changes in cases of sequential I/O on big files, but a 1.86GHzGHz Core2 can churn through 111MB/s of SHA256, which even SSDs willstruggle to keep up with.


I don't think that the realtime performance argument withstands scrutiny.

You are going to want to do different things with different data.  For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.

In terms of deduplication, IMO it should really all be uniform,transparent, and block based. In terms of specifying which subtrees todedupe, that should really be a per subdirectory hereditary attribute,kind of like compression was supposed to work with chattr +c in the past.

So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point.  This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.

I don't see that it adds any flexibility compared to the hereditarydeduping attribute. I also don't see that it is any cheaper. It'sactually more expensive, according to the reasoning above.

As an aside, zfs and lessfs both do online deduping, presumably for agood reason.

Then again, for a lot of use-cases there are perhaps better ways toachieve the targed goal than deduping on FS level, e.g. snapshotting orsomething like fl-cow:

http://www.xmailserver.org/flcow.html

Personally, I would still like to see a fl-cow like solution thatactually preserves the inode numbers of duplicate files while providingCOW functionality that breaks this unity (and inode number identity)upon writes, specifically because it saves page cache (only have tocache one copy) and in case of DLLs on chroot style virtualization(OpenVZ, Vserver, LXC) means that identical DLLs in all the guests areall mapped into the same memory, thus yielding massive memory savings onmachines with a lot of VMs.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Offline Deduplication for Btrfs

Reply via email to