Sorry, I busted my mail client.  That was from me.  :-P

On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreij...@inwind.it wrote:
> On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > Hi Chris,
> > 
> > it seems that systemd-journald is more smart/complex than I thought:
> > 
> > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > closes the files, it mark again these as COW then defrag [1]
> > 
> > 2) looking at the code, I suspect that systemd-journald closes the
> > file asynchronously [2]. This means that looking at the "live" journal
> > is not sufficient. In fact:
> > 
> > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > [...]
> > --------------------- 
> > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal
> > --------------------- 
> > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal
> > --------------------- 
> > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal
> > ---------------C----- 
> > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal
> > ---------------C----- 
> > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal
> > ---------------C----- 
> > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal
> > ---------------C----- 
> > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal
> > ---------------C----- user-1000.journal
> > ---------------C----- system.journal
> > 
> > The output above means that the last 6 files are "pending" for a 
> > de-fragmentation. When these will be
> > "closed", the NOCOW flag will be removed and a defragmentation will start.
> 
> Wait what?
> 
> > Now my journals have few (2 or 3 extents). But I saw cases where the extents
> > of the more recent files are hundreds, but after few "journalct --rotate" 
> > the older files become less
> > fragmented.
> > 
> > [1] 
> > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> 
> That line doesn't work, and systemd ignores the error.
> 
> The NOCOW flag cannot be set or cleared unless the file is empty.
> This is checked in btrfs_ioctl_setflags.
> 
> This is not something that can be changed easily--if the NOCOW bit is
> cleared on a non-empty file, btrfs data read code will expect csums
> that aren't present on disk because they were written while the file was
> NODATASUM, and the reads will fail pretty badly.  The entire file would
> have to have csums added or removed at the same time as the flag change
> (or all nodatacow file reads take a performance hit looking for csums
> that may or may not be present).
> 
> At file close, the systemd should copy the data to a new file with no
> special attributes and discard or recycle the old inode.  This copy
> will be mostly contiguous and have desirable properties like csums and
> compression, and will have iops equivalent to btrfs fi defrag.
> 
> > [2] 
> > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687
> > 
> > On 2/10/21 7:37 AM, Chris Murphy wrote:
> > > This is an active (but idle) system.journal file. That is, it's open
> > > but not being written to. I did a sync right before this:
> > > 
> > > https://pastebin.com/jHh5tfpe
> > > 
> > > And then: btrfs fi defrag -l 8M system.journal
> > > 
> > > https://pastebin.com/Kq1GjJuh
> > > 
> > > Looks like most of it was a no op. So it seems btrfs in this case is
> > > not confused by so many small extent items, it know they are
> > > contiguous?
> > > 
> > > It doesn't answer the question what the "too small" threshold is for
> > > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.
> > > 
> > > Another sync, and then, 'journalctl --rotate' and the resulting
> > > archived file is now:
> > > 
> > > https://pastebin.com/aqac0dRj
> > > 
> > > These are not the same results between the two ioctls for the same
> > > file, and not the same result as what you get with -l 32M (which I do
> > > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
> > > is peculiar, but I don't think we can say it's ineffective, it might
> > > be an intentional no op either because it's nodatacow or it sees that
> > > these many extents are mostly contiguous and not worth defragmenting
> > > (which would be good for keeping write amplification down).
> > > 
> > > So I don't know, maybe it's not wrong.
> > > 
> > > --
> > > Chris Murphy
> > > 
> > 
> > 
> > -- 
> > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

Attachment: signature.asc
Description: PGP signature



Reply via email to