On Tue, Feb 9, 2021 at 12:45 PM Goffredo Baroncelli <kreij...@inwind.it> wrote: > > On 2/9/21 8:01 PM, Chris Murphy wrote: > > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli <kreij...@inwind.it> > > wrote: > >> > >> On 2/9/21 1:42 AM, Chris Murphy wrote: > >>> Perhaps. Attach strace to journald before --rotate, and then --rotate > >>> > >>> https://pastebin.com/UGihfCG9 > >> > >> I looked to this strace. > >> > >> in line 115: it is called a ioctl(<BTRFS-DEFRAG>) > >> in line 123: it is called a ioctl(<BTRFS-DEFRAG>) > >> > >> However the two descriptors for which the defrag is invoked are never > >> sync-ed before. > >> > >> I was expecting is to see a sync (flush the data on the platters) and then > >> a > >> ioctl(<BTRFS-defrag>. This doesn't seems to be looking from the strace. > >> > >> I wrote a script (see below) which basically: > >> - create a fragmented file > >> - run filefrag on it > >> - optionally sync the file <----- > >> - run btrfs fi defrag on it > >> - run filefrag on it > >> > >> If I don't perform the sync, the defrag is ineffective. But if I sync the > >> file BEFORE doing the defrag, I got only one extent. > >> Now my hypothesis is: the journal log files are bad de-fragmented because > >> these > >> are not sync-ed before. > >> This could be tested quite easily putting an fsync() before the > >> ioctl(<BTRFS_DEFRAG>). > >> > >> Any thought ? > > > > No idea. If it's a full sync then it could be expensive on either > > slower devices or heavier workloads. On the one hand, there's no point > > of doing an ineffective defrag so maybe the defrag ioctl should just > > do the sync first? On the other hand, this would effectively make the > > defrag ioctl a full file system sync which might be unexpected. It's a > > set of tradeoffs and I don't know what the expectation is. > > > > What about fdatasync() on the journal file rather than a full sync? > > I tried a fsync(2) call, and the results is the same. > Only after reading your reply I realized that I used a sync(2), when > I meant to use fsync(2). > > I update my python test code
Ok fsync should be least costly of the three. The three unique things about systemd-journald that might be factors: * nodatacow file * fallocated file in 8MB increments multiple times up to 128M * BTRFS_IOC_DEFRAG, whereas btrfs-progs uses BTRFS_IOC_DEFRAG_RANGE So maybe it's all explained by lack of fsync, I'm not sure. But the commit that added this doesn't show any form of sync. https://github.com/systemd/systemd/commit/f27a386430cc7a27ebd06899d93310fb3bd4cee7 -- Chris Murphy