On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiang...@linux.alibaba.com> wrote: > > > > On 2025/7/16 07:32, Gao Xiang wrote: > > Hi Matthew, > > > > On 2025/7/16 04:40, Matthew Wilcox wrote: > >> I've started looking at how the page cache can help filesystems handle > >> compressed data better. Feedback would be appreciated! I'll probably > >> say a few things which are obvious to anyone who knows how compressed > >> files work, but I'm trying to be explicit about my assumptions. > >> > >> First, I believe that all filesystems work by compressing fixed-size > >> plaintext into variable-sized compressed blocks. This would be a good > >> point to stop reading and tell me about counterexamples. > > > > At least the typical EROFS compresses variable-sized plaintext (at least > > one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed > > blocks for efficient I/Os, which is really useful for small compression > > granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually > > under memory pressure so large compression granularity is almost > > unacceptable in the low memory scenarios, see: > > https://erofs.docs.kernel.org/en/latest/design.html > > > > Currently EROFS works pretty well on these devices and has been > > successfully deployed in billions of real devices. > > > >> > >> From what I've been reading in all your filesystems is that you want to > >> allocate extra pages in the page cache in order to store the excess data > >> retrieved along with the page that you're actually trying to read. That's > >> because compressing in larger chunks leads to better compression. > >> > >> There's some discrepancy between filesystems whether you need scratch > >> space for decompression. Some filesystems read the compressed data into > >> the pagecache and decompress in-place, while other filesystems read the > >> compressed data into scratch pages and decompress into the page cache. > >> > >> There also seems to be some discrepancy between filesystems whether the > >> decompression involves vmap() of all the memory allocated or whether the > >> decompression routines can handle doing kmap_local() on individual pages. > >> > >> So, my proposal is that filesystems tell the page cache that their minimum > >> folio size is the compression block size. That seems to be around 64k, > >> so not an unreasonable minimum allocation size. That removes all the > >> extra code in filesystems to allocate extra memory in the page cache.> It > >> means we don't attempt to track dirtiness at a sub-folio granularity > >> (there's no point, we have to write back the entire compressed bock > >> at once). We also get a single virtually contiguous block ... if you're > >> willing to ditch HIGHMEM support. Or there's a proposal to introduce a > >> vmap_file() which would give us a virtually contiguous chunk of memory > >> (and could be trivially turned into a noop for the case of trying to > >> vmap a single large folio). > > > > I don't see this will work for EROFS because EROFS always supports > > variable uncompressed extent lengths and that will break typical > > EROFS use cases and on-disk formats. > > > > Other thing is that large order folios (physical consecutive) will > > caused "increase the latency on UX task with filemap_fault()" > > because of high-order direct reclaims, see: > > https://android-review.googlesource.com/c/kernel/common/+/3692333 > > so EROFS will not set min-order and always support order-0 folios. > > > > I think EROFS will not use this new approach, vmap() interface is > > always the case for us. > > ... high-order folios can cause side effects on embedded devices > like routers and IoT devices, which still have MiBs of memory (and I > believe this won't change due to their use cases) but they also use > Linux kernel for quite long time. In short, I don't think enabling > large folios for those devices is very useful, let alone limiting > the minimum folio order for them (It would make the filesystem not > suitable any more for those users. At least that is what I never > want to do). And I believe this is different from the current LBS > support to match hardware characteristics or LBS atomic write > requirement.
Given the difficulty of allocating large folios, it's always a good idea to have order-0 as a fallback. While I agree with your point, I have a slightly different perspective — enabling large folios for those devices might be beneficial, but the maximum order should remain small. I'm referring to "small" large folios. Still, even with those, allocation can be difficult — especially since so many other allocations (which aren't large folios) can cause fragmentation. So having order-0 as a fallback remains important. It seems we're missing a mechanism to enable "small" large folios for files. For anon large folios, we do have sysfs knobs—though they don’t seem to be universally appreciated. :-) Thanks Barry _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel