Sorry, I busted my mail client. That was from me. :-P On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreij...@inwind.it wrote: > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > Hi Chris, > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > closes the files, it mark again these as COW then defrag [1] > > > > 2) looking at the code, I suspect that systemd-journald closes the > > file asynchronously [2]. This means that looking at the "live" journal > > is not sufficient. In fact: > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > [...] > > --------------------- > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd4f-0005baed61106a18.journal > > --------------------- > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000bd64-0005baed659feff4.journal > > --------------------- > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000bd67-0005baed65a0901f.journal > > ---------------C----- > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cc63-0005bafed4f12f0a.journal > > ---------------C----- > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cc85-0005baff0ce27e49.journal > > ---------------C----- > > system@3f2405cf9bcf42f0abe6de5bc702e394-000000000000cd38-0005baffe9080b4d.journal > > ---------------C----- > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-000000000000cd3b-0005baffe908f244.journal > > ---------------C----- user-1000.journal > > ---------------C----- system.journal > > > > The output above means that the last 6 files are "pending" for a > > de-fragmentation. When these will be > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > Wait what? > > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > > of the more recent files are hundreds, but after few "journalct --rotate" > > the older files become less > > fragmented. > > > > [1] > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > That line doesn't work, and systemd ignores the error. > > The NOCOW flag cannot be set or cleared unless the file is empty. > This is checked in btrfs_ioctl_setflags. > > This is not something that can be changed easily--if the NOCOW bit is > cleared on a non-empty file, btrfs data read code will expect csums > that aren't present on disk because they were written while the file was > NODATASUM, and the reads will fail pretty badly. The entire file would > have to have csums added or removed at the same time as the flag change > (or all nodatacow file reads take a performance hit looking for csums > that may or may not be present). > > At file close, the systemd should copy the data to a new file with no > special attributes and discard or recycle the old inode. This copy > will be mostly contiguous and have desirable properties like csums and > compression, and will have iops equivalent to btrfs fi defrag. > > > [2] > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L3687 > > > > On 2/10/21 7:37 AM, Chris Murphy wrote: > > > This is an active (but idle) system.journal file. That is, it's open > > > but not being written to. I did a sync right before this: > > > > > > https://pastebin.com/jHh5tfpe > > > > > > And then: btrfs fi defrag -l 8M system.journal > > > > > > https://pastebin.com/Kq1GjJuh > > > > > > Looks like most of it was a no op. So it seems btrfs in this case is > > > not confused by so many small extent items, it know they are > > > contiguous? > > > > > > It doesn't answer the question what the "too small" threshold is for > > > BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. > > > > > > Another sync, and then, 'journalctl --rotate' and the resulting > > > archived file is now: > > > > > > https://pastebin.com/aqac0dRj > > > > > > These are not the same results between the two ioctls for the same > > > file, and not the same result as what you get with -l 32M (which I do > > > get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result > > > is peculiar, but I don't think we can say it's ineffective, it might > > > be an intentional no op either because it's nodatacow or it sees that > > > these many extents are mostly contiguous and not worth defragmenting > > > (which would be good for keeping write amplification down). > > > > > > So I don't know, maybe it's not wrong. > > > > > > -- > > > Chris Murphy > > > > > > > > > -- > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
signature.asc
Description: PGP signature