On Fri, Jan 09, 2015 at 04:41:03PM +0100, David Sterba wrote: > On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote: > > Hmmm...it seems the handwaving about tail-packing that I was previously > > ignoring is important after all. > > > > A few quick tests with filefrag show that btrfs isn't doing full > > tail-packing, only small file allocation (i.e. files smaller than 4096 > > bytes get stored inline, and nothing else does, not even sparse files > > with a single 1-byte extent at offset != 0). Thus the inline storage > > avoids fragmentation only to the minimum extent possible. > > That's right, btrfs does not do the reiserfs-style tail packing, and > IMHO will never do that. This brings a lot of code complexity than it's > worth in the end.
If the file has been fallocated past EOF, it may make sense to do the extra work of maintaining a tail fragment in metadata until it's bigger than a block, and therefore large enough to write to the fallocated extent. At least in that case the application has explicitly asked the filesystem for more optimization than in the general append case. Otherwise, what are fallocations past EOF for? If the application appends 4K blocks all the time everything is fine, but that requirement might not work for journald, and doesn't work for rsyslog, mboxes, and many other long-running small-write use cases that append in non-block-sized units. On the other hand...it could be easier to handle such cases with a special case of autodefrag--one that focuses on appends, so it can be enabled by default earlier than the other problematic autodefrag use cases. It may even be faster to defragment in small batches (coalescing a few hundred blocks at a time near the end of file) than to do tail-packing on every append, especially if metadata blocks have much more overhead than data blocks (e.g. dup metadata with single data on spinning rust). The fallocate would be wasted in this case, but the number of fragments in the final file would be reasonably sane. > > > > This would work on ext4, xfs, and others, and provide the same benefit > > > > (or even better) without filesystem-specific code. journald would > > > > preallocate a contiguous chunk past the end of the file for appends, > > > > and > > > > > > That's precisely what we do. But journald's write pattern is not > > > purely appending to files, it's "append something to the end, then > > > link it up in the beginning". And for the "append" part we are > > > fine with fallocate(). It's the "link up" part that completely fucks > > > up fragmentation so far. > > > > Wrong theory but same result. The writes at the beginning just keep > > replacing a single extent over and over, which has a worst-case effect > > of adding a single fragment to the beginning of a file that would not > > otherwise be fragmented. The appends are causing fragmentation all > > by themselves. :-P > > OTOH, the appending write and the header rewrite happen at roughly same > time so the actual block allocations may end up close to each other as > well. But yes, one cannot rely on that. The header rewrite is close to the last append, but that's not really useful. There will be one header near one appending write, but there are also thousands of other appending writes separated in time and space on the disk, even after fallocate preallocated contiguous space for the file.
signature.asc
Description: Digital signature