Hi Matthew,
On 2025/7/16 04:40, Matthew Wilcox wrote:
I've started looking at how the page cache can help filesystems handle
compressed data better. Feedback would be appreciated! I'll probably
say a few things which are obvious to anyone who knows how compressed
files work, but I'm trying to be explicit about my assumptions.
First, I believe that all filesystems work by compressing fixed-size
plaintext into variable-sized compressed blocks. This would be a good
point to stop reading and tell me about counterexamples.
At least the typical EROFS compresses variable-sized plaintext (at least
one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed
blocks for efficient I/Os, which is really useful for small compression
granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually
under memory pressure so large compression granularity is almost
unacceptable in the low memory scenarios, see:
https://erofs.docs.kernel.org/en/latest/design.html
Currently EROFS works pretty well on these devices and has been
successfully deployed in billions of real devices.
From what I've been reading in all your filesystems is that you want to
allocate extra pages in the page cache in order to store the excess data
retrieved along with the page that you're actually trying to read. That's
because compressing in larger chunks leads to better compression.
There's some discrepancy between filesystems whether you need scratch
space for decompression. Some filesystems read the compressed data into
the pagecache and decompress in-place, while other filesystems read the
compressed data into scratch pages and decompress into the page cache.
There also seems to be some discrepancy between filesystems whether the
decompression involves vmap() of all the memory allocated or whether the
decompression routines can handle doing kmap_local() on individual pages.
So, my proposal is that filesystems tell the page cache that their minimum
folio size is the compression block size. That seems to be around 64k,
so not an unreasonable minimum allocation size. That removes all the
extra code in filesystems to allocate extra memory in the page cache.> It means
we don't attempt to track dirtiness at a sub-folio granularity
(there's no point, we have to write back the entire compressed bock
at once). We also get a single virtually contiguous block ... if you're
willing to ditch HIGHMEM support. Or there's a proposal to introduce a
vmap_file() which would give us a virtually contiguous chunk of memory
(and could be trivially turned into a noop for the case of trying to
vmap a single large folio).
I don't see this will work for EROFS because EROFS always supports
variable uncompressed extent lengths and that will break typical
EROFS use cases and on-disk formats.
Other thing is that large order folios (physical consecutive) will
caused "increase the latency on UX task with filemap_fault()"
because of high-order direct reclaims, see:
https://android-review.googlesource.com/c/kernel/common/+/3692333
so EROFS will not set min-order and always support order-0 folios.
I think EROFS will not use this new approach, vmap() interface is
always the case for us.
Thanks,
Gao Xiang
_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel