On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>> Hi,
>>
[...]

>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is 
>> enough space on the filesystem. Due to the COW nature of BTRFS, you cannot 
>> rely on the already allocated space because there could be a small time 
>> window where both the old and the new data exists on the disk.

> There is also an expectation based on pretty much every other FS in existence 
> that calling fallocate() on a range that is already in use is a (possibly 
> expensive) no-op, and by extension using fallocate() with an offset of 0 like 
> a ftruncate() call will succeed as long as the new size will fit.

The man page of fallocate doesn't guarantee that.

Unfortunately in a COW filesystem the assumption that an allocate area may be 
simply overwritten is not true. 

Let me to say it with others words: as general rule if you want to _write_ 
something in a cow filesystem, you need space. Doesn't matter if you are 
*over-writing* existing data or you are *appending* to a file.


> 
> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), 
> NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ 
> on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log 
> structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and 
> _all_ of them behave correctly here and succeed with the test I listed, while 
> BTRFS does not.  This isn't codified in POSIX, but it's also not something 
> that is listed as implementation defined, which in turn means that we should 
> be trying to match the other implementations.

[...]

> 
>>
>> My opinion is that in general this behavior is correct due to the COW nature 
>> of BTRFS.
>> The only exception that I can find, is about the "nocow" file. For these 
>> cases taking in accout the already allocated space would be better.
> There are other, saner ways to make that expectation hold though, and I'm not 
> even certain that it does as things are implemented (I believe we still CoW 
> unwritten extents when data is written to them, because I _have_ had writes 
> to fallocate'ed files fail on BTRFS before with -ENOSPC).
> 
> The ideal situation IMO is as follows:
> 
> 1. This particular case (using fallocate() with an offset of 0 to extend a 
> file that is already larger than half the remaining free space on the FS) 
> _should_ succeed.  

This description is not accurate. What happened is the following:
1) you have a file *with valid data*
2) you want to prepare an update of this file and want to be sure to have 
enough space

at this point fallocate have to guarantee:
a) you have your old data still available
b) you have allocated the space for the update

In terms of a COW filesystem, you need the space of a) + the space of b)


> Short of very convoluted configurations, extending a file with fallocate will 
> not result in over-committing space on a CoW filesystem unless it would 
> extend the file by more than the remaining free space, and therefore barring 
> long external interactions, subsequent writes will also succeed.  Proof of 
> this for a general case is somewhat complicated, but in the very specific 
> case of the script I posted as a reproducer in the other thread about this 
> and the test case I gave in this thread, it's trivial to prove that the 
> writes will succeed.  Either way, the behavior of SnapRAID, while not optimal 
> in this case, is still a legitimate usage (I've seen programs do things like 
> that just to make sure the file isn't sparse).
> 
> 2. Conversion of unwritten extents to written ones should not require new 
> allocation.  Ideally, we need to be allocating not just space for the data, 
> but also reasonable space for the associated metadata when allocating an 
> unwritten extent, and there should be no CoW involved when they are written 
> to except for the small metadata updates required to account the new blocks.  
> Unless we're doing this, then we have edge cases where the the above listed 
> expectation does not hold (also note that GlobalReserve does not count IMO, 
> it's supposed to be for temporary usage only and doesn't ever appear to be 
> particularly large).
> 
> 3. There should be some small amount of space reserved globally for not just 
> metadata, but data too, so that a 'full' filesystem can still update existing 
> files reliably.  I'm not sure that we're not doing this already, but AIUI, 
> GlobalReserve is metadata only.  If we do this, we don't have to worry _as 
> much_ about avoiding CoW when converting unwritten extents to regular ones.
>>
>> Comments are welcome.
>>
>> BR
>> G.Baroncelli
>>
>> [1] from man 2 fallocate
>> [...]
>>         After  a  successful call, subsequent writes into the range 
>> specified by offset and len are
>>         guaranteed not to fail because of lack of disk space.
>> [...]
>>
>>
>> [2]
>>
>> -- create a 5G btrfs filesystem
>>
>> # mkdir t1
>> # truncate --size 5G disk
>> # losetup /dev/loop0 disk
>> # mkfs.btrfs /dev/loop0
>> # mount /dev/loop0 t1
>>
>> -- test
>> -- create a 1500 MB file, the expand it to 4000MB
>> -- expected result: the file is 4000MB size
>> -- result: fail: the expansion fails
>>
>> # fallocate -l $((1024*1024*100*15))  file.bin
>> # fallocate -l $((1024*1024*100*40))  file.bin
>> fallocate: fallocate failed: No space left on device
>> # ls -lh file.bin
>> -rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin
>>
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to