On Tue, Jul 15, 2025 at 09:40:42PM +0100, Matthew Wilcox wrote: > I've started looking at how the page cache can help filesystems handle > compressed data better. Feedback would be appreciated! I'll probably > say a few things which are obvious to anyone who knows how compressed > files work, but I'm trying to be explicit about my assumptions. > > First, I believe that all filesystems work by compressing fixed-size > plaintext into variable-sized compressed blocks. This would be a good > point to stop reading and tell me about counterexamples.
As far as I know, btrfs with zstd does not used fixed size plaintext. I am going off the btrfs logic itself, not the zstd internals which I am sadly ignorant of. We are using the streaming interface for whatever that is worth. Through the following callpath, the len is piped from the async_chunk\ through to zstd via the slightly weirdly named total_out parameter: compress_file_range() btrfs_compress_folios() compression_compress_pages() zstd_compress_folios() zstd_get_btrfs_parameters() // passes len zstd_init_cstream() // passes len for-each-folio: zstd_compress_stream() // last folio is truncated if short # bpftrace to check the size in the zstd callsite $ sudo bpftrace -e 'fentry:zstd_init_cstream {printf("%llu\n", args.pledged_src_size);}' Attaching 1 probe... 76800 # diff terminal, write a compressed extent with a weird source size $ sudo dd if=/dev/zero of=/mnt/lol/foo bs=75k count=1 We do operate in terms of folios for calling zstd_compress_stream, so that can be thought of as a fixed size plaintext block, but even so, we pass in a short block for the last one: $ sudo bpftrace -e 'fentry:zstd_compress_stream {printf("%llu\n", args.input->size);}' Attaching 1 probe... 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 4096 3072 > > From what I've been reading in all your filesystems is that you want to > allocate extra pages in the page cache in order to store the excess data > retrieved along with the page that you're actually trying to read. That's > because compressing in larger chunks leads to better compression. > > There's some discrepancy between filesystems whether you need scratch > space for decompression. Some filesystems read the compressed data into > the pagecache and decompress in-place, while other filesystems read the > compressed data into scratch pages and decompress into the page cache. > > There also seems to be some discrepancy between filesystems whether the > decompression involves vmap() of all the memory allocated or whether the > decompression routines can handle doing kmap_local() on individual pages. > > So, my proposal is that filesystems tell the page cache that their minimum > folio size is the compression block size. That seems to be around 64k, btrfs has a max uncompressed extent size of 128K, for what it's worth. In practice, many compressed files are comprised of a large number of compressed extents each representing a 128k plaintext extent. Not sure if that is exactly the constant you are concerned with here, or if it refutes your idea in any way, just figured I would mention it as well. > so not an unreasonable minimum allocation size. That removes all the > extra code in filesystems to allocate extra memory in the page cache. > It means we don't attempt to track dirtiness at a sub-folio granularity > (there's no point, we have to write back the entire compressed bock > at once). We also get a single virtually contiguous block ... if you're > willing to ditch HIGHMEM support. Or there's a proposal to introduce a > vmap_file() which would give us a virtually contiguous chunk of memory > (and could be trivially turned into a noop for the case of trying to > vmap a single large folio). > _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel