On 9/9/25 19:46, Brian Song wrote:
>
>
> On 9/9/25 10:48 AM, Stefan Hajnoczi wrote:
>> On Wed, Sep 03, 2025 at 02:00:55PM -0400, Brian Song wrote:
>>>
>>>
>>> On 9/3/25 6:53 AM, Stefan Hajnoczi wrote:
>>>> On Fri, Aug 29, 2025 at 10:50:22PM -0400, Brian Song wrote:
>>>>> This patch adds a new export option for storage-export-daemon to enable
>>>>> FUSE-over-io_uring via the switch io-uring=on|off (disableby default).
>>>>> It also implements the protocol handshake with the Linux kernel
>>>>> during the FUSE-over-io_uring initialization phase.
>>>>>
>>>>> See: https://docs.kernel.org/filesystems/fuse-io-uring.html
>>>>>
>>>>> The kernel documentation describes in detail how FUSE-over-io_uring
>>>>> works. This patch implements the Initial SQE stage shown in thediagram:
>>>>> it initializes one queue per IOThread, each currently supporting a
>>>>> single submission queue entry (SQE). When the FUSE driver sends the
>>>>> first FUSE request (FUSE_INIT), storage-export-daemon calls
>>>>> fuse_uring_start() to complete initialization, ultimately submitting
>>>>> the SQE with the FUSE_IO_URING_CMD_REGISTER command to confirm
>>>>> successful initialization with the kernel.
>>>>>
>>>>> We also added support for multiple IOThreads. The current Linux kernel
>>>>> requires registering $(nproc) queues when setting up FUSE-over-io_uring
>>>>> To let users customize the number of FUSE Queues (i.e., IOThreads),
>>>>> we first create nproc Ring Queues as required by the kernel, then
>>>>> distribute them in a round-robin manner to the FUSE Queues for
>>>>> registration. In addition, to support multiple in-flight requests,
>>>>> we configure each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH
>>>>> entries/requests.
>>>>
>>>> The previous paragraph says "each currently supporting a single
>>>> submission queue entry (SQE)" whereas this paragraph says "we configure
>>>> each Ring Queue with FUSE_DEFAULT_RING_QUEUE_DEPTH entries/requests".
>>>> Maybe this paragraph was squashed into the commit description in a later
>>>> step and the previous paragraph can be updated to reflect that multiple
>>>> SQEs are submitted?
>>>>
>>>>>
>>>>> Suggested-by: Kevin Wolf <kw...@redhat.com>
>>>>> Suggested-by: Stefan Hajnoczi <stefa...@redhat.com>
>>>>> Signed-off-by: Brian Song <hibrians...@gmail.com>
>>>>> ---
>>>>> block/export/fuse.c | 310 +++++++++++++++++++++++++--
>>>>> docs/tools/qemu-storage-daemon.rst | 11 +-
>>>>> qapi/block-export.json | 5 +-
>>>>> storage-daemon/qemu-storage-daemon.c | 1 +
>>>>> util/fdmon-io_uring.c | 5 +-
>>>>> 5 files changed, 309 insertions(+), 23 deletions(-)
>>>>>
>>>>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>>> index c0ad4696ce..19bf9e5f74 100644
>>>>> --- a/block/export/fuse.c
>>>>> +++ b/block/export/fuse.c
>>>>> @@ -48,6 +48,9 @@
>>>>> #include <linux/fs.h>
>>>>> #endif
>>>>> +/* room needed in buffer to accommodate header */
>>>>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
>>>>
>>>> Is it possible to write this in a way that shows how the constant is
>>>> calculated? That way the constant would automatically adjust on systems
>>>> where the underlying assumptions have changed (e.g. page size, header
>>>> struct size). This approach is also self-documenting so it's possible to
>>>> understand where the magic number comes from.
>>>>
>>>> For example:
>>>>
>>>> #define FUSE_BUFFER_HEADER_SIZE DIV_ROUND_UP(sizeof(struct
>>>> fuse_uring_req_header), qemu_real_host_page_size())
>>>>
>>>> (I'm guessing what the formula you used is, so this example may be
>>>> incorrect...)
>>>>
>>>
>>> In libfuse, the way to calculate the bufsize (for req_payload) is the same
>>> as in this patch. For different requests, the request header sizes are not
>>> the same, but they should never exceed a certain value. So is that why
>>> libfuse has this kind of magic number?
>>
>> From <linux/fuse.h>:
>>
>> #define FUSE_URING_IN_OUT_HEADER_SZ 128
>> #define FUSE_URING_OP_IN_OUT_SZ 128
>> ...
>> struct fuse_uring_req_header {
>> /* struct fuse_in_header / struct fuse_out_header */
>> char in_out[FUSE_URING_IN_OUT_HEADER_SZ];
>>
>> /* per op code header */
>> char op_in[FUSE_URING_OP_IN_OUT_SZ];
>>
>> struct fuse_uring_ent_in_out ring_ent_in_out;
>> };
>>
>> The size of struct fuse_uring_req_header is 128 + 128 + (4 * 8) = 288
>> bytes. A single 4 KB page easily fits this. I guess that's why 0x1000
>> was chosen in libfuse.
>>
>
> Yes, the two iovecs in the ring entry: one refers to the general request
> header (fuse_uring_req_header) and the other refers to the payload. The
> variable bufsize represents the space for these two objects and is used
> to calculate the payload size in case max_write changes.
>
> Alright, let me document the buffer usage. It's been a while since I
> started this, so I don’t fully remember how the buffer works here.
For current kernel code we could make this 288 allocations for the header.
This just does not work with page pinning, which we are using at DDN
(kernel patches not upstreamed yet).
Maybe I should make the header allocation way dependent if page pinning,
there is a bit overhead with 4K headers, although 4K doesn't sound too bad,
even with many queues.
Thanks,
Bernd