Josef Bacik wrote:

Basically I think online dedup is huge waste of time and completely useless.

I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline?

The only sane way to keep track of hashes of existing blocks is using an index. Searches through an index containing evenly distributed data (such as hashes) is pretty fast (log(N)), and this has to be done regardless of whether the dedupe is online or offline. It also goes without saying that all the blocks being deduplicated need to be hashed, and the cost of this is also the same whether the block is hashes online or offline.

Let's look at the relative merits:

1a) Offline
We have to copy the entire data set. This means we are using the full amount of disk writes that the data set size dictates. Do we do the hashing of current blocks at this point to create the indexes? Or do we defer it until some later time?

Doing it at the point of writes is cheaper - we already have the data in RAM and we can calculate the hashes as we are writing each block. Performance implications of this are fairly analogous to the parity RAID RMW performance issue - to achieve decent performance you have to write the parity at the same time as the rest of the stripe, otherwise you have to read the part of the stripe you didn't write, before calculating the checksum.

So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced.

How is this in any way advantageous?

1b) Online
As we are writing the data, we calculate the hashes for each block. (See 1a for argument of why I believe this is saner and cheaper than doing it offline.) Since we already have these hashes, we can do a look-up in the hash-index, and either write out the block as is (if that hash isn't already in the index) or simply write the pointer to an existing suitable block (if it already exists). This saves us writing out that block - fewer writes to the disk, not to mention we don't later have to re-read the block to dedupe it.

So in this case, instead of write-read-relink of the offline scenario, we simply do relink on duplicate blocks.

There is another reason to favour the online option due to it's lower write stress - SSDs. Why hammer the SSD with totally unnecessary writes?

The _only_ reason to defer deduping is that hashing costs CPU time. But the chances are that a modern CPU core can churn out MD5 and/or SHA256 hashes faster than a modern mechanical disk can keep up. A 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will struggle to keep up with.

I don't think that the realtime performance argument withstands scrutiny.

You are going to want to do different things with different data.  For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.

In terms of deduplication, IMO it should really all be uniform, transparent, and block based. In terms of specifying which subtrees to dedupe, that should really be a per subdirectory hereditary attribute, kind of like compression was supposed to work with chattr +c in the past.

So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point.  This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.

I don't see that it adds any flexibility compared to the hereditary deduping attribute. I also don't see that it is any cheaper. It's actually more expensive, according to the reasoning above.

As an aside, zfs and lessfs both do online deduping, presumably for a good reason.

Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow:
http://www.xmailserver.org/flcow.html

Personally, I would still like to see a fl-cow like solution that actually preserves the inode numbers of duplicate files while providing COW functionality that breaks this unity (and inode number identity) upon writes, specifically because it saves page cache (only have to cache one copy) and in case of DLLs on chroot style virtualization (OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are all mapped into the same memory, thus yielding massive memory savings on machines with a lot of VMs.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to