Re: Why is the actual disk usage of btrfs considered unknowable?

Austin S Hemmelgarn Mon, 08 Dec 2014 06:58:24 -0800

On 2014-12-08 09:47, Martin Steigerwald wrote:

Hi,


Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:

On 12/07/2014 07:40 AM, Martin Steigerwald wrote:

Well what would be possible I bet would be a kind of system call like
this:

I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
can I do it *and* give me a guarentee I can.

So like a more flexible fallocate approach as fallocate just allocates one
file and you would need to run it for all files you intend to create. But
challenge would be to estimate metadata allocation beforehand accurately.

Or have tar --fallocate -xf which for all files in the archive will first
call fallocate and only if that succeeded, actually write them. But due
to the nature of tar archives with their content listing across the whole
archive, this means it may have to read the tar archive twice, so ZIP
archives might be better suited for that.


What you suggest is Still Not Practical™ (the tar thing might have some
ability if you were willing to analyze every file to the byte level).

Compression _can_ make a file _bigger_ than its base size. BTRFS decides
whether or not to compress a file based on the results it gets when
tying to compress the first N bytes. (I do not know the value of N). But
it is _easy_ to have a file where the first N bytes compress well but
the bytes after N take up more space than their byte count. So to
fallocate() the right size in blocks you'd have to compress the input
and determine what BTRFS _would_ _do_ and then allocate that much space
instead of the file size.

And even then, if you didn't create all the names and directories you
might find that the RBtree had to expand (allocate another tree node)
one or more times to accommodate the actual files. Lather rinse repeat
for any checksum trees and anything hitting a flush barrier because of
commit= or sync() events or other writers perturbing your results
because it only matters if the filesystem is nearly full and nearly full
filesystems may not be quiescent at all.

So while the core problem isn't insoluble, in real life it is _not_
_worth_ _solving_.

On a nearly empty filesystem, it's going to fit.

In a reasonably empty filesystem, it's going to fit.

On a nearly full filesystem, it may or may not fit.

On a filesystem that is so close to full that you have reason to doubt
it will fit, you are going to have a very bad time even if it fits.

If you did manage to invent and implement an fallocate algorythm that
could make this promise and make it stick, then some other running
program is what's going to crash when you use up that last byte anyway.

Almost full filesystems are their own reward.


So you basically say that BTRFS with compression  does not meet the fallocate
guarantee. Now thats interesting, cause it basically violates the
documentation for the system call:

DESCRIPTION
        The function posix_fallocate() ensures that disk space  is  allo‐
        cated for the file referred to by the descriptor fd for the bytes
        in the range starting at offset and  continuing  for  len  bytes.
        After  a  successful call to posix_fallocate(), subsequent writes
        to bytes in the  specified  range  are  guaranteed  not  to  fail
        because of lack of disk space.

So in order to be standard compliant there, BTRFS would need to write
fallocated files uncompressed… wow this is getting complex.

The other option would be to allocate based on the worst case size increase for the compression algorithm, (which works out to about 5% IIRC for zlib and a bit more for lzo) and then possibly discard the unwritten extents at some later point.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Why is the actual disk usage of btrfs considered unknowable?

Reply via email to