On 31.10.25 13:33, Kevin Wolf wrote:
Am 31.10.2025 um 13:13 hat Hanna Czenczek geschrieben:
On 23.10.25 17:11, Kevin Wolf wrote:
Am 01.07.2025 um 13:44 hat Hanna Czenczek geschrieben:
We probably want to support larger write sizes than just 4k; 64k seems
nice.  However, we cannot read partial requests from the FUSE FD, we
always have to read requests in full; so our read buffer must be large
enough to accommodate potential 64k writes if we want to support that.

Always allocating FuseRequest objects with 64k buffers in them seems
wasteful, though.  But we can get around the issue by splitting the
buffer into two and using readv(): One part will hold all normal (up to
4k) write requests and all other requests, and a second part (the
"spill-over buffer") will be used only for larger write requests.  Each
FuseQueue has its own spill-over buffer, and only if we find it used
when reading a request will we move its ownership into the FuseRequest
object and allocate a new spill-over buffer for the queue.

This way, we get to support "large" write sizes without having to
allocate big buffers when they aren't used.

Also, this even reduces the size of the FuseRequest objects because the
read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
the requests we support are not quite so large (except for >4k writes),
so until now, we basically had to have useless padding in there.

With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
is easily met and we can decrease the size of the buffer portion that is
right inside of FuseRequest.

As for benchmarks, the benefit of this patch can be shown easily by
writing a 4G image (with qemu-img convert) to a FUSE export:
- Before this patch: Takes 25.6 s (14.4 s with -t none)
- After this patch: Takes 4.5 s (5.5 s with -t none)

Reviewed-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Hanna Czenczek <[email protected]>
The commit message seems outdated, there is no such thing as a
FuseRequest object.

I agree with the idea of allocating a separate buffer for the data to be
written. I'm not so sure that the approach taken here with combining an
in-place and a spillover buffer does actually do much for us in exchange
for the additional complexity.

The allocation + memcpy for in_place buf in fuse_co_write() bothers me a
bit. I'd rather have a buffer for the data to write that can be directly
used. And let's be real, we already allocate a 1 MB stack per request. I
don't think 64k more or less make a big difference, but it would allow
us to save the memcpy() for 4k requests and additionally an allocation
for larger requests.

The tradeoff when you use an iov for the buffer in FuseQueue that is
only big enough for the header and fuse_write_in and then directly the
per-request buffer that is owned by the coroutine is that for requests
that are larger than fuse_write_in, you'll have to copy the rest back
from the data buffer first. This seems to be only fuse_setattr_in, which
shouldn't be a hot path at all, and only a few bytes.
So I understand that first, you disagree with “Always allocating FuseRequest
objects with 64k buffers in them seems wasteful, though.” I.e. to just use a
64k buffer per request.  OK, fair.
I think in practice most write requests will exceed the 4k anyway, so
we'd already use the spillover buffer. Maybe the most common exception
is fio, if we want to optimise for that one. :-)

Second, you suggest to improve performance by having an aligned 64k data
buffer separate from the request metadata buffer to save on memcpy().  I did
consider this, but discarded it because of I was afraid of the complexity.
Actually probably too afraid.

I’ll take a look, it actually can’t be too hard.  (Just have two buffers;
make the metadata buffer long enough to capture all request headers, but
only pass the sizeof(fuse_write_in)-long head into readv(), then check the
request opcode.  If metadata spilled over to the data buffer, copy it back
into the “shadowed” metadata buffer tail.)
Yes, that's what I had in mind. Not sure if it's premature
optimisation...

Just to confirm – this means we will have to do an allocation for each write request.  I think this should still be cheaper than the current memcpy(), but still wanted to ask back whether you agree.

(We could have a pool of buffers, but that would most definitely be a premature optimization – and I hope our memalloc algorithm can do that internally just fine.)

Hanna


Reply via email to