On 2015-07-11 11:24, Duncan wrote:
I'm not a coder, only a list regular and btrfs user, and I'm not sure on
this, but there have been several reports of this nature on the list
recently, and I have a theory.  Maybe the devs can step in and either
confirm or shoot it down.
While I am a coder, I'm not a BTRFS developer, so what I say below may still be incorrect.

[...trimmed for brevity...]
Of course during normal use, files get deleted as well, thereby clearing
space in existing chunks.  But this space will be fragmented, with a mix
of unallocated extents and still remaining files.  The allocator will I
/believe/ (this is where people who can actually read the code come in)
try to use up space in existing chunks before allocating additional
space, possibly subject to some reasonable extent minimum size, below
which btrfs will simply allocate another chunk.
AFAICT, this is in fact the case.

1) Prioritize reduced fragmentation, at the expense of higher data chunk
allocation.  In the extreme, this would mean always choosing to allocate
a new chunk and use it if the file (or remainder of the file not yet
defragged) was larger than the largest free extent in existing data
chunks.

The problem with this is that over time, the number of partially used
data chunks goes up as new ones are allocated to defrag into, but sub-1
GiB files that are already defragged are left where they are.  Of course
a balance can help here, by combining multiple partial chunks into fewer
full chunks, but unless a balance is run...

2) Prioritize chunk utilization, at the expense of leaving some
fragmentation, despite massive amounts of unallocated space.

This is what I've begun to suspect defrag does.  With a bunch of free but
fragmented space in existing chunks, defrag could actually increase
fragmentation, as the space in existing chunks is so fragmented a rewrite
is forced to use more, smaller extents, because that's all there is free,
until another chunk is allocated.

As I mentioned above for normal file allocation, it's quite possible that
there's some minimum extent size (greater than the bare minimum 4 KiB
block size) where the allocator will give up and allocate a new data
chunk, but if so, perhaps this size needs bumped upward, as it seems a
bit low, today.
If I'm reading the code correctly, defrag does indeed try to avoid allocating a new chunk if at all possible.


Meanwhile, there's a number of exacerbating factors to consider as well.

* Snapshots and other shared references lock extents in place.

Defrag doesn't touch anything but the subvolume it's actually pointed at
for the defrag.  Other subvolumes and shared-reference files will
continue to keep the extents they reference locked in place.  And COW
will rewrite blocks of a file, but the old reference extent remains
locked, until all references to it are cleared -- the entire file (or at
least all blocks that were in that extent) must be rewritten, and no
snapshots or other references to it remain, before it can be freed.

For a few kernel cycles btrfs had snapshot-aware-defrag, but that
implementation didn't scale well at all, so it was disabled until it
could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
aware-defrag remains disabled, and defrag only works on the subvolume
it's actually pointed at.

As a result, if defrag rewrites a snapshotted file, it actually doubles
the space that file takes, as it makes a new copy, breaking the reference
link between it and the copy in the snapshot.

Of course, with the space not freed up, this will, over time, tend to
fragment space that is freed even more heavily.
To mitigate this, one can run offline data deduplication (duperemove is the tool I'd suggest for this), although there are caveats to doing that as well.

* Chunk reclamation.

This is the relatively new development that I think is triggering the
surge in defrag not defragging reports we're seeing now.

Until quite recently, btrfs could allocate new chunks, but it couldn't,
on its own, deallocate empty chunks.  What tended to happen over time was
that people would find all the filesystem space taken up by empty or
mostly empty data chunks, and btrfs would start spitting ENOSPC errors
when it needed to allocate new metadata chunks but couldn't, as all the
space was in empty data chunks.  A balance could fix it, often relatively
quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a
manual process, btrfs wouldn't do it on its own.

Recently the devs (mostly) fixed that, and btrfs will automatically
reclaim entirely empty chunks on its own now.  It still doesn't reclaim
partially empty chunks automatically; a manual rebalance must still be
used to combine multiple partially empty chunks into fewer full chunks;
but it does well enough to make the previous problem pretty rare -- we
don't see the hundreds of GiB of empty data chunks allocated any more,
like we used to.

Which fixed the one problem, but if my theory is correct, it exacerbated
the defrag issue, which I think was there before but seldom triggered so
it generally wasn't noticed.

What I believe is happening now compared to before, based on the rash of
reports we're seeing, is that before, space fragmentation in allocated
data chunks seldom became an issue, because people tended to accumulate
all these extra empty data chunks, leaving defrag all that unfragmented
empty space to rewrite the new extents into as it did the defrag.

But now, all those empty data chunks are reclaimed, leaving defrag only
the heavily space-fragmented partially used chunks.  So now we're getting
all these reports of defrag actually making the problem worse, not better!
I believe that this is in fact the root cause. Personally, I would love to be able to turn this off without having to patch the kernel. Since it went in, not only does it (apparently) cause issues with defrag, but DISCARD/TRIM support is broken, and most of my (heavily rewritten) filesystems are running noticeably slower as well. I'm going to start a discussion regarding this in another thread however, as it doesn't just affect defrag.


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to