On 2017-08-02 00:14, Duncan wrote:
Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
excerpted:
I think I _might_ understand what's going on here. Is that test program
calling fallocate using the desired total size of the file, or just
trying to allocate the range beyond the end to extend the file? I've
seen issues with the first case on BTRFS before, and I'm starting to
think that it might actually be trying to allocate the exact amount of
space requested by fallocate, even if part of the range is already
allocated space.
If I've interpreted correctly (not being a dev, only a btrfs user,
sysadmin, and list regular) previous discussions I've seen on this list...
That's exactly what it's doing, and it's _intended_ behavior.
The reasoning is something like this: fallocate is supposed to pre-
allocate some space with the intent being that writes into that space
won't fail, because the space is already allocated.
For an existing file with some data already in it, ext4 and xfs do that
counting the existing space.
But btrfs is copy-on-write, meaning it's going to have to write the new
data to a different location than the existing data, and it may well not
free up the existing allocation (if even a single 4k block of the
existing allocation remains unwritten, it will remain to hold down the
entire previous allocation, which isn't released until *none* of it is
still in use -- of course in normal usage "in use" can be due to old
snapshots or other reflinks to the same extent, as well, tho in these
test cases it's not).
So in ordered to provide the writes to preallocated space shouldn't ENOSPC
guarantee, btrfs can't count currently actually used space as part of the
fallocate.
The different behavior is entirely due to btrfs being COW, and thus a
choice having to be made, do we worst-case fallocate-reserve for writes
over currently used data that will have to be COWed elsewhere, possibly
without freeing the existing extents because there's still something
referencing them, or do we risk ENOSPCing on write to a previously
fallocated area?
The choice was to worst-case-reserve and take the ENOSPC risk at fallocate
time, so the write into that fallocated space could then proceed without
the ENOSPC risk that COW would otherwise imply.
Make sense, or is my understanding a horrible misunderstanding? =:^)
Your reasoning is sound, except for the fact that at least on older
kernels (not sure if this is still the case), BTRFS will still perform a
COW operation when updating a fallocate'ed region.
So if you're actually only appending, fallocate the /additional/ space,
not the /entire/ space, and you'll get what you need. But if you're
potentially overwriting what's there already, better fallocate the entire
space, which triggers the btrfs worst-case allocation behavior you see,
in ordered to guarantee it won't ENOSPC during the actual write.
Of course the only time the behavior actually differs is with COW, but
then there's a BIG difference, but that BIG difference has a GOOD BIG
reason! =:^)
Tho that difference will certainly necessitate some relearning the
/correct/ way to do it, for devs who were doing it the COW-worst-case way
all along, even if they didn't actually need to, because it didn't happen
to make a difference on what they happened to be testing on, which
happened not to be COW...
Reminds me of the way newer versions of gcc and/or trying to build with
clang as well tends to trigger relearning, because newer versions are
stricter in ordered to allow better optimization, and other
implementations are simply different in what they're strict on, /because/
they're a different implementation. Well, btrfs is stricter... because
it's a different implementation that /has/ to be stricter... due to COW.
Except that that strictness breaks userspace programs that are doing
perfectly reasonable things.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html