在 2025/7/16 10:46, Gao Xiang 写道:
...
There's some discrepancy between filesystems whether you need scratch
space for decompression. Some filesystems read the compressed data into
the pagecache and decompress in-place, while other filesystems read the
compressed data into scratch pages and decompress into the page cache.
Btrfs goes the scratch pages way. Decompression in-place looks a
little tricky to me. E.g. what if there is only one compressed page,
and it decompressed to 4 pages.
Decompression in-place mainly optimizes full decompression (so that CPU
cache line won't be polluted by temporary buffers either), in fact,
EROFS supports the hybird way.
Won't the plaintext over-write the compressed data halfway?
Personally I'm very familiar with LZ4, LZMA, and DEFLATE
algorithm internals, and I also have experience to build LZMA,
DEFLATE compressors.
It's totally workable for LZ4, in short it will read the compressed
data at the end of the decompressed buffers, and the proper margin
can make this almost always succeed.
I guess that's why btrfs can not go that way.
Due to data COW, we're totally possible to hit a case that we only want
to read out one single plaintext block from a compressed data extent
(the compressed size can even be larger than one block).
In that case such in-place decompression will definitely not work.
[...]
All the decompression/compression routines all support swapping input/
output buffer when one of them is full.
So kmap_local() is completely feasible.
I think one of the btrfs supported algorithm LZO is not,
It is, the tricky part is btrfs is implementing its own TLV structure
for LZO compression.
And btrfs does extra padding to ensure no TLV (compressed data + header)
structure will cross block boundary.
So btrfs LZO compression is still able to swap out input/output halfway,
mostly due to the btrfs' specific design.
Thanks,
Qu
because the
fastest LZ77-family algorithms like LZ4, LZO just operates on virtual
consecutive buffers and treat the decompressed buffer as LZ77 sliding
window.
So that either you need to allocate another temporary consecutive
buffer (I believe that is what btrfs does) or use vmap() approach,
EROFS is interested in the vmap() one.
Thanks,
Gao Xiang
Thanks,
Qu
So, my proposal is that filesystems tell the page cache that their
minimum
folio size is the compression block size. That seems to be around 64k,
so not an unreasonable minimum allocation size. That removes all the
extra code in filesystems to allocate extra memory in the page cache.
It means we don't attempt to track dirtiness at a sub-folio granularity
(there's no point, we have to write back the entire compressed bock
at once). We also get a single virtually contiguous block ... if you're
willing to ditch HIGHMEM support. Or there's a proposal to introduce a
vmap_file() which would give us a virtually contiguous chunk of memory
(and could be trivially turned into a noop for the case of trying to
vmap a single large folio).
_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel