On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> > If we want the data compressed (and who doesn't? journal data compresses > 8:1 with btrfs zstd) then we'll always need to make a copy at close. > Because systemd used prealloc, the copy is necessarily to a new inode, > as there's no way to re-enable compression on an inode once prealloc > is used (this has deep disk-format reasons, but not as deep as the > nodatacow ones). Pretty sure sd-journald still fallocates when datacow by touching /etc/tmpfiles.d/journal-nocow.conf And I know for sure those datacow files do compress on rotation. Preallocated datacow might not be so bad if it weren't for that one damn header or indexing block, whatever the proper term is, that sd-journald hammers every time it fsyncs. I don't know if I wanna know what it means to snapshot a datacow file that's prealloc. But in theory if the same blocks weren't all being hammered, a preallocated file shouldn't fragment like hell if each prealloc block gets just one write. > If we don't care about compression or datasums, then keep the file > nodatacow and do nothing at close. The defrag isn't needed and the > FS_NOCOW_FL flag change doesn't work. Agreed. > It makes sense for SSD too. It's 4K extents, so the metadata and small-IO > overheads will be non-trivial even on SSD. Deleting or truncating datacow > journal files will put a lot of tiny free space holes into the filesystem. > It will flood the next commit with delayed refs and push up latency. I haven't seen meaningful latency on a single journal file, datacow and heavily fragmented, on ssd. But to test on more than one file at a time i need to revert the defrag commits, and build systemd, and let a bunch of journals accumulate somehow. If I dump too much data artificially to try and mimic aging, I know I will get nowhere near as many of those 4KiB extents. So I dunno. > > > In that case the fragmentation is > > quite considerable, hundreds to thousands of extents. It's > > sufficiently bad that it'd be probably be better if they were > > defragmented automatically with a trigger that tests for number of > > non-contiguous small blocks that somehow cheaply estimates latency > > reading all of them. > > Yeah it would be nice of autodefrag could be made to not suck. It triggers on inserts, not appends. So it doesn't do anything for the sd-journald case. I would think the active journals are the one more likely to get searched for recent events than archived journals. So in the datacow case, you only get relief once it's rotated. It'd be nice to find an decent, not necessarily perfect, way for them to not get so fragmented in the first place. Or just defrag once a file has 16M of non-contiguous extents. Estimating extents though is another issue, especially with compression enabled. -- Chris Murphy