On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote: > 4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS > is healthy.
I've found issues with OOB dedup (clone/extent-same): 1. Don't dedup data that has not been committed--either call fsync() on it, or check the generation numbers on each extent before deduping it, or make sure the data is not being actively modified during dedup; otherwise, a race condition may lead to the the filesystem locking up and becoming inaccessible until the kernel is rebooted. This is particularly important if you are doing bedup-style incremental dedup on a live system. I've worked around #1 by placing a fsync() call on the src FD immediately before calling FILE_EXTENT_SAME. When I do an A/B experiment with and without the fsync, "with-fsync" runs for weeks at a time without issues, while "without-fsync" hangs, sometimes in just a matter of hours. Note that the fsync() doesn't resolve the underlying race condition, it just makes the filesystem hang less often. 2. There is a practical limit to the number of times a single duplicate extent can be deduplicated. As more references to a shared extent are created, any part of the filesystem that uses backref walking code gets slower. This includes dedup itself, balance, device replace/delete, FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate files are executables). Several factors (including file size and number of snapshots) are involved, making it difficult to devise workarounds or set up test cases. 99.5% of the time, these operations just get slower by a few ms each time a new reference is created, but the other 0.5% of the time, write operations will abruptly grow to consume hours of CPU time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs) when they touch one of these over-shared extents. When this occurs, it effectively (but not literally) crashes the host machine. I've worked around #2 by building tables of "toxic" hashes that occur too frequently in a filesystem to be deduped, and using these tables in dedup software to ignore any duplicate data matching them. These tables can be relatively small as they only need to list hashes that are repeated more than a few thousand times, and typical filesystems (up to 10TB or so) have only a few hundred such hashes. I happened to have a couple of machines taken down by these issues this very weekend, so I can confirm the issues are present in kernels 4.4.21, 4.5.7, and 4.7.4.
signature.asc
Description: Digital signature