On Thu, Jan 08, 2015 at 05:53:21PM +0100, Lennart Poettering wrote:
> On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote:
> 
> > On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
> > > Heya!
> > > 
> > > Currently, systemd-journald's disk access patterns (appending to the
> > > end of files, then updating a few pointers in the front) result in
> > > awfully fragmented journal files on btrfs, which has a pretty
> > > negative effect on performance when accessing them.
> > > 
> > > Now, to improve things a bit, I yesterday made a change to journald,
> > > to issue the btrfs defrag ioctl when a journal file is rotated,
> > > i.e. when we know that no further writes will be ever done on the
> > > file. 
> > > 
> > > However, I wonder now if I should go one step further even, and use
> > > the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> > > wondering what price I would precisely have to pay for
> > > that. Judging by this earlier thread:
> > > 
> > >         http://www.spinics.net/lists/linux-btrfs/msg33134.html
> > > 
> > > it's mostly about data integrity, which is something I can live with,
> > > given the conservative write patterns of journald, and the fact that
> > > we do our own checksumming and careful data validation. I mean, if
> > > btrfs in this mode provides no worse data integrity semantics than
> > > ext4 I am fully fine with losing this feature for these files.
> > 
> > This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.
> 
> We already use fallocate(), but this is not enough on cow file
> systems. With fallocate() you can certainly improve fragmentation when
> appending things to a file. But on a COW file system this will help
> little if we change things in the beginning of the file, since COW
> means that it will then make a copy of those blocks and alter the
> copy, but leave the original version unmodified. And if we do that all
> the time the files get heavily fragmented, even though all the blocks
> we modify have been fallocate()d initially...

Hmmm...it seems the handwaving about tail-packing that I was previously
ignoring is important after all.

A few quick tests with filefrag show that btrfs isn't doing full
tail-packing, only small file allocation (i.e. files smaller than 4096
bytes get stored inline, and nothing else does, not even sparse files
with a single 1-byte extent at offset != 0).  Thus the inline storage
avoids fragmentation only to the minimum extent possible.

Short appends to the end of the file effectively become modifications
of the last block of the file.  That triggers CoW on the append, and if
we're doing lots of tiny writes the file becomes extremely fragmented
(exactly the worst case of one fragment per block).  A mix of big and
small appends seems to use fallocated space for those writes that cover
complete blocks, which is arguably worse than not fallocating at all.

So fallocate will not help until btrfs learns to do tail-packing, or
some other way to avoid this problem.

> > This would work on ext4, xfs, and others, and provide the same benefit
> > (or even better) without filesystem-specific code.  journald would
> > preallocate a contiguous chunk past the end of the file for appends,
> > and
> 
> That's precisely what we do. But journald's write pattern is not
> purely appending to files, it's "append something to the end, then
> link it up in the beginning". And for the "append" part we are
> fine with fallocate(). It's the "link up" part that completely fucks
> up fragmentation so far.

Wrong theory but same result.  The writes at the beginning just keep
replacing a single extent over and over, which has a worst-case effect
of adding a single fragment to the beginning of a file that would not
otherwise be fragmented.  The appends are causing fragmentation all
by themselves.  :-P

> Lennart
> 
> -- 
> Lennart Poettering, Red Hat
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Attachment: signature.asc
Description: Digital signature

Reply via email to