On 2016-09-18 23:47, Zygo Blaxell wrote:
OK, that's good to know. In my case, I'm not operating on a very big
data set (less than 40GB, but the storage cluster I'm doing this on only
has about 200GB of total space, so I'm trying to conserve as much as
possible), and it's mostly static data (less than 100MB worth of changes
a day except on Sunday when I run backups), so it makes sense that I've
not seen either of these issues.
On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote:
4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS
I've found issues with OOB dedup (clone/extent-same):
1. Don't dedup data that has not been committed--either call fsync()
on it, or check the generation numbers on each extent before deduping
it, or make sure the data is not being actively modified during dedup;
otherwise, a race condition may lead to the the filesystem locking up and
becoming inaccessible until the kernel is rebooted. This is particularly
important if you are doing bedup-style incremental dedup on a live system.
I've worked around #1 by placing a fsync() call on the src FD immediately
before calling FILE_EXTENT_SAME. When I do an A/B experiment with and
without the fsync, "with-fsync" runs for weeks at a time without issues,
while "without-fsync" hangs, sometimes in just a matter of hours. Note
that the fsync() doesn't resolve the underlying race condition, it just
makes the filesystem hang less often.
2. There is a practical limit to the number of times a single duplicate
extent can be deduplicated. As more references to a shared extent
are created, any part of the filesystem that uses backref walking code
gets slower. This includes dedup itself, balance, device replace/delete,
FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate
files are executables). Several factors (including file size and number
of snapshots) are involved, making it difficult to devise workarounds or
set up test cases. 99.5% of the time, these operations just get slower
by a few ms each time a new reference is created, but the other 0.5% of
the time, write operations will abruptly grow to consume hours of CPU
time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs)
when they touch one of these over-shared extents. When this occurs,
it effectively (but not literally) crashes the host machine.
I've worked around #2 by building tables of "toxic" hashes that occur too
frequently in a filesystem to be deduped, and using these tables in dedup
software to ignore any duplicate data matching them. These tables can
be relatively small as they only need to list hashes that are repeated
more than a few thousand times, and typical filesystems (up to 10TB or
so) have only a few hundred such hashes.
I happened to have a couple of machines taken down by these issues this
very weekend, so I can confirm the issues are present in kernels 4.4.21,
4.5.7, and 4.7.4.
The second one sounds like the same performance issue caused by having
very large numbers of snapshots, and based on what's happening, I don't
think there's any way we could fix it without rewriting certain core code.
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html