Re: is BTRFS_IOC_DEFRAG behavior optimal?

Chris Murphy Wed, 10 Feb 2021 19:41:11 -0800

On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
<[email protected]> wrote:
>
> Sorry, I busted my mail client.  That was from me.  :-P
>
> On Wed, Feb 10, 2021 at 10:08:37PM -0500, [email protected] wrote:
> > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > > Hi Chris,
> > >
> > > it seems that systemd-journald is more smart/complex than I thought:
> > >
> > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > > closes the files, it mark again these as COW then defrag [1]
> > >
> > > 2) looking at the code, I suspect that systemd-journald closes the
> > > file asynchronously [2]. This means that looking at the "live" journal
> > > is not sufficient. In fact:
> > >
> > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > > [...]
> > > --------------------- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> > > --------------------- 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> > > --------------------- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> > > ---------------C----- 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> > > ---------------C----- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> > > ---------------C----- 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> > > ---------------C----- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> > > ---------------C----- user-1000.journal
> > > ---------------C----- system.journal
> > >
> > > The output above means that the last 6 files are "pending" for a 
> > > de-fragmentation. When these will be
> > > "closed", the NOCOW flag will be removed and a defragmentation will start.
> >
> > Wait what?
> >
> > > Now my journals have few (2 or 3 extents). But I saw cases where the 
> > > extents
> > > of the more recent files are hundreds, but after few "journalct --rotate" 
> > > the older files become less
> > > fragmented.
> > >
> > > [1] 
> > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> >
> > That line doesn't work, and systemd ignores the error.
> >
> > The NOCOW flag cannot be set or cleared unless the file is empty.
> > This is checked in btrfs_ioctl_setflags.
> >
> > This is not something that can be changed easily--if the NOCOW bit is
> > cleared on a non-empty file, btrfs data read code will expect csums
> > that aren't present on disk because they were written while the file was
> > NODATASUM, and the reads will fail pretty badly.  The entire file would
> > have to have csums added or removed at the same time as the flag change
> > (or all nodatacow file reads take a performance hit looking for csums
> > that may or may not be present).
> >
> > At file close, the systemd should copy the data to a new file with no
> > special attributes and discard or recycle the old inode.  This copy
> > will be mostly contiguous and have desirable properties like csums and
> > compression, and will have iops equivalent to btrfs fi defrag.


Journals implement their own checksumming. Yeah, if there's
corruption, Btrfs raid can't do a transparent fixup. But the whole
journal isn't lost, just the affected record. *shrug* I think if (a)
nodatacow and/or (b) SSD, just leave it alone. Why add more writes?

In particular the nodatacow case where I'm seeing consistently the
file made from multiples of 8MB contiguous blocks, even on HDD the
seek latency here can't be worth defraging the file.

I think defrag makes sense (a) datacow journals, i.e. the default
nodatacow is inhibited (b) HDD. In that case the fragmentation is
quite considerable, hundreds to thousands of extents. It's
sufficiently bad that it'd be probably be better if they were
defragmented automatically with a trigger that tests for number of
non-contiguous small blocks that somehow cheaply estimates latency
reading all of them. Since the files are interleaved, doing something
like "systemctl status dbus" might actually read many blocks even if
the result isn't a whole heck of alot of visible data.

But on SSD, cow or nocow, and HDD nocow - I think just leave them alone.

-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

Reply via email to