On Wed, Apr 23, 2025 at 06:07:33PM +0200, Christoph Hellwig wrote: > On Wed, Apr 23, 2025 at 06:37:41AM -0400, Kent Overstreet wrote: > > > It also don't support bio chaining or error handling and requires a > > > single bio that is guaranteed to fit the required number of vectors. > > > > Why would bio chaining ever be required? The caller allocates both the > > buf and the bio, I've never seen an instance where you'd want that; just > > allocate a bio with the correct number of vecs, which your > > bio_vmalloc_max_vecs() helps with. > > If you go beyond 1MB I/O for vmalloc you need it because a single > bio can't hold enough page size chunks. That is unless you want > to use your own allocation for it and call bio_init which has various > other downsides.
Allocating your own bio doesn't allow you to safely exceed the BIO_MAX_VECS limit - there's places in the io path that need to bounce, and they all use biosets. That may be an issue even for non vmalloc bios, unless everything that bounces has been converted to bounce to a folio of the same order. > > The "abstract over vmalloc and normal physically contigious allocations" > > bit that bch2_bio_map() does is the important part. > > > > It's not uncommon to prefer physically contiguous allocations but have a > > vmalloc fallback; bcachefs does, and xfs does with a clever "try the > > big allocation if it's cheap, fall back to vmalloc to avoid waiting on > > compaction" that I might steal. > > > > is_vmalloc_addr() is also cheap, it's just a pointer comparison (and it > > really should be changed to a static inline). > > The problem with transparent vmalloc handling is that it's not possible. > The magic handling for virtually indexed caches can be hidden on the > submission side, but the completion side also needs to call > invalidate_kernel_vmap_range for reads. Requiring the caller to know > they deal vmalloc is a way to at least keep that on the radar. yeesh, that's a landmine. having a separate bio_add_vmalloc as a hint is still a really bad "solution", unfortunately. And since this is something we don't have sanitizers or debug code for, and it only shows up on some archs - that's nasty. > The other benefit is that by forcing different calls it is much > easier to pick the optimal number of bvecs (1) for the non-vmalloc > path, although that is of course also possible without it. Your bio_vmalloc_max_vecs() could trivially handle both vmalloc and non vmalloc addresses. > Not for a purely synchronous helper we could handle both, but so far > I've not seen anything but the xfs log recovery code that needs it, > and we'd probably get into needing to pass a bio_set to avoid > deadlock when used deeper in the stack, etc. I can look into that > if we have more than a single user, but for now it doesn't seem > worth it. bcache and bcachefs btree buffers can also be vmalloc backed. Possibly also the prio_set path in bcache, for reading/writing bucket gens, but I'd have to check. > Having a common helper for vmalloc and the kernel direct mapping > is actually how I started, but then I ran into all the issues with > it and with the extremely simple helpers for the direct mapping > which are used a lot, and the more complicated version for vmalloc > which just has a few users instead. *nod* What else did you run into? invalidate_kernel_vmap_range() seems like the only problematic one, given that is_vmalloc_addr() is cheap.