On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > > Sorry, I busted my mail client. That was from me. :-P > > On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreij...@inwind.it wrote: > > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > > Hi Chris, > > > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > > closes the files, it mark again these as COW then defrag [1] > > > > > > 2) looking at the code, I suspect that systemd-journald closes the > > > file asynchronously [2]. This means that looking at the "live" journal > > > is not sufficient. In fact: > > > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > > [...] > > > --------------------- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > > > --------------------- > > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > > > --------------------- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > > > ---------------C----- > > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > > > ---------------C----- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > > > ---------------C----- > > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > > > ---------------C----- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > > > ---------------C----- user-1000.journal > > > ---------------C----- system.journal > > > > > > The output above means that the last 6 files are "pending" for a > > > de-fragmentation. When these will be > > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > > > Wait what? > > > > > Now my journals have few (2 or 3 extents). But I saw cases where the > > > extents > > > of the more recent files are hundreds, but after few "journalct --rotate" > > > the older files become less > > > fragmented. > > > > > > [1] > > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > > > That line doesn't work, and systemd ignores the error. > > > > The NOCOW flag cannot be set or cleared unless the file is empty. > > This is checked in btrfs_ioctl_setflags. > > > > This is not something that can be changed easily--if the NOCOW bit is > > cleared on a non-empty file, btrfs data read code will expect csums > > that aren't present on disk because they were written while the file was > > NODATASUM, and the reads will fail pretty badly. The entire file would > > have to have csums added or removed at the same time as the flag change > > (or all nodatacow file reads take a performance hit looking for csums > > that may or may not be present). > > > > At file close, the systemd should copy the data to a new file with no > > special attributes and discard or recycle the old inode. This copy > > will be mostly contiguous and have desirable properties like csums and > > compression, and will have iops equivalent to btrfs fi defrag.
Journals implement their own checksumming. Yeah, if there's corruption, Btrfs raid can't do a transparent fixup. But the whole journal isn't lost, just the affected record. *shrug* I think if (a) nodatacow and/or (b) SSD, just leave it alone. Why add more writes? In particular the nodatacow case where I'm seeing consistently the file made from multiples of 8MB contiguous blocks, even on HDD the seek latency here can't be worth defraging the file. I think defrag makes sense (a) datacow journals, i.e. the default nodatacow is inhibited (b) HDD. In that case the fragmentation is quite considerable, hundreds to thousands of extents. It's sufficiently bad that it'd be probably be better if they were defragmented automatically with a trigger that tests for number of non-contiguous small blocks that somehow cheaply estimates latency reading all of them. Since the files are interleaved, doing something like "systemctl status dbus" might actually read many blocks even if the result isn't a whole heck of alot of visible data. But on SSD, cow or nocow, and HDD nocow - I think just leave them alone. -- Chris Murphy