On Mon, Sep 19, 2016 at 08:32:14AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-09-18 23:47, Zygo Blaxell wrote:
> >On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote:
> >>4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS
> >>is healthy.
> >
> >I've found issues with OOB dedup (clone/extent-same):
> >
> >1.  Don't dedup data that has not been committed--either call fsync()
> >on it, or check the generation numbers on each extent before deduping
> >it, or make sure the data is not being actively modified during dedup;
> >otherwise, a race condition may lead to the the filesystem locking up and
> >becoming inaccessible until the kernel is rebooted.  This is particularly
> >important if you are doing bedup-style incremental dedup on a live system.
> >
> >I've worked around #1 by placing a fsync() call on the src FD immediately
> >before calling FILE_EXTENT_SAME.  When I do an A/B experiment with and
> >without the fsync, "with-fsync" runs for weeks at a time without issues,
> >while "without-fsync" hangs, sometimes in just a matter of hours.  Note
> >that the fsync() doesn't resolve the underlying race condition, it just
> >makes the filesystem hang less often.
> >
> >2.  There is a practical limit to the number of times a single duplicate
> >extent can be deduplicated.  As more references to a shared extent
> >are created, any part of the filesystem that uses backref walking code
> >gets slower.  This includes dedup itself, balance, device replace/delete,
> >FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate
> >files are executables).  Several factors (including file size and number
> >of snapshots) are involved, making it difficult to devise workarounds or
> >set up test cases.  99.5% of the time, these operations just get slower
> >by a few ms each time a new reference is created, but the other 0.5% of
> >the time, write operations will abruptly grow to consume hours of CPU
> >time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs)
> >when they touch one of these over-shared extents.  When this occurs,
> >it effectively (but not literally) crashes the host machine.
> >
> >I've worked around #2 by building tables of "toxic" hashes that occur too
> >frequently in a filesystem to be deduped, and using these tables in dedup
> >software to ignore any duplicate data matching them.  These tables can
> >be relatively small as they only need to list hashes that are repeated
> >more than a few thousand times, and typical filesystems (up to 10TB or
> >so) have only a few hundred such hashes.
> >
> >I happened to have a couple of machines taken down by these issues this
> >very weekend, so I can confirm the issues are present in kernels 4.4.21,
> >4.5.7, and 4.7.4.
> OK, that's good to know.  In my case, I'm not operating on a very big data
> set (less than 40GB, but the storage cluster I'm doing this on only has
> about 200GB of total space, so I'm trying to conserve as much as possible),
> and it's mostly static data (less than 100MB worth of changes a day except
> on Sunday when I run backups), so it makes sense that I've not seen either
> of these issues.

I ran into issue #2 on an 8GB filesystem last weekend.  The lower limit
on filesystem size could be as low as a few megabytes if they're arranged
in *just* the right way.

> The second one sounds like the same performance issue caused by having very
> large numbers of snapshots, and based on what's happening, I don't think
> there's any way we could fix it without rewriting certain core code.

find_parent_nodes is the usual culprit for CPU usage.  Fixing this is
required for in-band dedup as well, so I assume someone has it on their
roadmap and will get it done eventually.

Attachment: signature.asc
Description: Digital signature

Reply via email to