On Fri, Jan 09, 2015 at 04:41:03PM +0100, David Sterba wrote:
> On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote:
> > Hmmm...it seems the handwaving about tail-packing that I was previously
> > ignoring is important after all.
> > 
> > A few quick tests with filefrag show that btrfs isn't doing full
> > tail-packing, only small file allocation (i.e. files smaller than 4096
> > bytes get stored inline, and nothing else does, not even sparse files
> > with a single 1-byte extent at offset != 0).  Thus the inline storage
> > avoids fragmentation only to the minimum extent possible.
> 
> That's right, btrfs does not do the reiserfs-style tail packing, and
> IMHO will never do that. This brings a lot of code complexity than it's
> worth in the end.

If the file has been fallocated past EOF, it may make sense to do the
extra work of maintaining a tail fragment in metadata until it's bigger
than a block, and therefore large enough to write to the fallocated
extent.  At least in that case the application has explicitly asked
the filesystem for more optimization than in the general append case.
Otherwise, what are fallocations past EOF for?

If the application appends 4K blocks all the time everything is fine,
but that requirement might not work for journald, and doesn't work for
rsyslog, mboxes, and many other long-running small-write use cases
that append in non-block-sized units.

On the other hand...it could be easier to handle such cases with a special
case of autodefrag--one that focuses on appends, so it can be enabled
by default earlier than the other problematic autodefrag use cases.
It may even be faster to defragment in small batches (coalescing a few
hundred blocks at a time near the end of file) than to do tail-packing
on every append, especially if metadata blocks have much more overhead
than data blocks (e.g. dup metadata with single data on spinning rust).
The fallocate would be wasted in this case, but the number of fragments
in the final file would be reasonably sane.

> > > > This would work on ext4, xfs, and others, and provide the same benefit
> > > > (or even better) without filesystem-specific code.  journald would
> > > > preallocate a contiguous chunk past the end of the file for appends,
> > > > and
> > > 
> > > That's precisely what we do. But journald's write pattern is not
> > > purely appending to files, it's "append something to the end, then
> > > link it up in the beginning". And for the "append" part we are
> > > fine with fallocate(). It's the "link up" part that completely fucks
> > > up fragmentation so far.
> > 
> > Wrong theory but same result.  The writes at the beginning just keep
> > replacing a single extent over and over, which has a worst-case effect
> > of adding a single fragment to the beginning of a file that would not
> > otherwise be fragmented.  The appends are causing fragmentation all
> > by themselves.  :-P
> 
> OTOH, the appending write and the header rewrite happen at roughly same
> time so the actual block allocations may end up close to each other as
> well. But yes, one cannot rely on that.

The header rewrite is close to the last append, but that's not really
useful.  There will be one header near one appending write, but there are
also thousands of other appending writes separated in time and space on
the disk, even after fallocate preallocated contiguous space for the file.

Attachment: signature.asc
Description: Digital signature

Reply via email to