On 2015-07-11 11:24, Duncan wrote:
While I am a coder, I'm not a BTRFS developer, so what I say below may still be incorrect.I'm not a coder, only a list regular and btrfs user, and I'm not sure on this, but there have been several reports of this nature on the list recently, and I have a theory. Maybe the devs can step in and either confirm or shoot it down.
[...trimmed for brevity...]
Of course during normal use, files get deleted as well, thereby clearing space in existing chunks. But this space will be fragmented, with a mix of unallocated extents and still remaining files. The allocator will I /believe/ (this is where people who can actually read the code come in) try to use up space in existing chunks before allocating additional space, possibly subject to some reasonable extent minimum size, below which btrfs will simply allocate another chunk.
AFAICT, this is in fact the case.
If I'm reading the code correctly, defrag does indeed try to avoid allocating a new chunk if at all possible.1) Prioritize reduced fragmentation, at the expense of higher data chunk allocation. In the extreme, this would mean always choosing to allocate a new chunk and use it if the file (or remainder of the file not yet defragged) was larger than the largest free extent in existing data chunks. The problem with this is that over time, the number of partially used data chunks goes up as new ones are allocated to defrag into, but sub-1 GiB files that are already defragged are left where they are. Of course a balance can help here, by combining multiple partial chunks into fewer full chunks, but unless a balance is run... 2) Prioritize chunk utilization, at the expense of leaving some fragmentation, despite massive amounts of unallocated space. This is what I've begun to suspect defrag does. With a bunch of free but fragmented space in existing chunks, defrag could actually increase fragmentation, as the space in existing chunks is so fragmented a rewrite is forced to use more, smaller extents, because that's all there is free, until another chunk is allocated. As I mentioned above for normal file allocation, it's quite possible that there's some minimum extent size (greater than the bare minimum 4 KiB block size) where the allocator will give up and allocate a new data chunk, but if so, perhaps this size needs bumped upward, as it seems a bit low, today.
To mitigate this, one can run offline data deduplication (duperemove is the tool I'd suggest for this), although there are caveats to doing that as well.Meanwhile, there's a number of exacerbating factors to consider as well. * Snapshots and other shared references lock extents in place. Defrag doesn't touch anything but the subvolume it's actually pointed at for the defrag. Other subvolumes and shared-reference files will continue to keep the extents they reference locked in place. And COW will rewrite blocks of a file, but the old reference extent remains locked, until all references to it are cleared -- the entire file (or at least all blocks that were in that extent) must be rewritten, and no snapshots or other references to it remain, before it can be freed. For a few kernel cycles btrfs had snapshot-aware-defrag, but that implementation didn't scale well at all, so it was disabled until it could be rewritten, and that rewrite hasn't occurred yet. So snapshot- aware-defrag remains disabled, and defrag only works on the subvolume it's actually pointed at. As a result, if defrag rewrites a snapshotted file, it actually doubles the space that file takes, as it makes a new copy, breaking the reference link between it and the copy in the snapshot. Of course, with the space not freed up, this will, over time, tend to fragment space that is freed even more heavily.
I believe that this is in fact the root cause. Personally, I would love to be able to turn this off without having to patch the kernel. Since it went in, not only does it (apparently) cause issues with defrag, but DISCARD/TRIM support is broken, and most of my (heavily rewritten) filesystems are running noticeably slower as well. I'm going to start a discussion regarding this in another thread however, as it doesn't just affect defrag.* Chunk reclamation. This is the relatively new development that I think is triggering the surge in defrag not defragging reports we're seeing now. Until quite recently, btrfs could allocate new chunks, but it couldn't, on its own, deallocate empty chunks. What tended to happen over time was that people would find all the filesystem space taken up by empty or mostly empty data chunks, and btrfs would start spitting ENOSPC errors when it needed to allocate new metadata chunks but couldn't, as all the space was in empty data chunks. A balance could fix it, often relatively quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a manual process, btrfs wouldn't do it on its own. Recently the devs (mostly) fixed that, and btrfs will automatically reclaim entirely empty chunks on its own now. It still doesn't reclaim partially empty chunks automatically; a manual rebalance must still be used to combine multiple partially empty chunks into fewer full chunks; but it does well enough to make the previous problem pretty rare -- we don't see the hundreds of GiB of empty data chunks allocated any more, like we used to. Which fixed the one problem, but if my theory is correct, it exacerbated the defrag issue, which I think was there before but seldom triggered so it generally wasn't noticed. What I believe is happening now compared to before, based on the rash of reports we're seeing, is that before, space fragmentation in allocated data chunks seldom became an issue, because people tended to accumulate all these extra empty data chunks, leaving defrag all that unfragmented empty space to rewrite the new extents into as it did the defrag. But now, all those empty data chunks are reclaimed, leaving defrag only the heavily space-fragmented partially used chunks. So now we're getting all these reports of defrag actually making the problem worse, not better!
smime.p7s
Description: S/MIME Cryptographic Signature