On 2025-08-01 12:09 p.m., Brian Song wrote:
Hi Bernd,
We are currently working on implementing termination support for fuse-
over-io_uring in QEMU, and right now we are focusing on how to clean up
in-flight SQEs properly. Our main question is about how well the kernel
supports robust cancellation for these fuse-over-io_uring SQEs. Does it
actually implement cancellation beyond destroying the io_uring queue?
In QEMU FUSE export, we need a way to quickly and cleanly detach from
the event loop and cancel any pending SQEs when an export is no longer
in use. Ideally, we want to avoid the more drastic measure of having to
close the entire /dev/fuse fd just to gracefully terminate outstanding
operations.
We are not sure if there's an existing code path that supports async
cancel for these in-flight SQEs in the fuse-over-io_uring setup, or if
additional callbacks might be needed to fully integrate with the
kernel's async cancel mechanism. We also realized libfuse manages
shutdowns differently, typically by signaling a thread via eventfd
rather than relying on async cancel.
Would love to hear your thoughts or suggestions on this!
Thanks,
Brian
I looked into the kernel codebase and came up with some initial ideas,
which might not be entirely accurate:
The IORING_OP_ASYNC_CANCEL operation can only cancel io_uring ring
resources and a limited set of request types. It does not clean up
resources related to fuse-over-io_uring, such as in-use entries.
IORING_OP_ASYNC_CANCEL
-> submit/enter
-> io_uring/opdef.c:: .issue = io_async_cancel,
-> __io_async_cancel
-> io_try_cancel ==> Can only cancel few types of requests
Currently, full cleanup of both io_uring and FUSE data structures for
fuse-over-io_uring only happens in two cases: [since we have mark these
SQEs cancelable when we commit_and_fetch everytime(mentioned below)]
1.When the FUSE daemon exits (exit syscall)
2.During execve, which triggers the kernel path:
io_uring_files_cancel =>
io_uring_try_cancel_uring_cmd =>
file->f_op->uring_cmd(cmd, IO_URING_F_CANCEL | IO_URING_F_COMPLETE_DEFER)
Below is a state diagram (mermaid graph) of a fuse_uring entry inside
the kernel:
graph TD
A["Userspace daemon"] -->
B["FUSE_IO_URING_CMD_REGISTER<br/>Register buffer"]
B --> C["Create fuse_ring_ent"]
C --> D["State: FRRS_AVAILABLE<br/>Added to ent_avail_queue"]
E["FUSE filesystem operation"] --> F["Generate FUSE request"]
F --> G["fuse_uring_queue_fuse_req()"]
G --> H{"Check ent_avail_queue"}
H -->|Entry available| I["Take entry from queue<br/>Assign to FUSE
request"]
H -->|No entry available| J["Request goes to fuse_req_queue and waits"]
I --> K["fuse_uring_dispatch_ent()"]
K --> L["State: FRRS_USERSPACE<br/>Move to ent_in_userspace"]
L --> M["Notify userspace to process"]
N["Process exit / daemon termination"] -->
O["io_uring_try_cancel_uring_cmd() <br/> >> NOTE Since we marked the
entry IORING_URING_CMD_CANCELABLE <br/> in the previous fuse_uring_cmd ,
try_cancel_uring_cmd will call <br/> fuse_uring_cmd to 'delete' it <<"]
O --> P["fuse_uring_cancel()"]
P --> Q{"Is entry state AVAILABLE?"}
Q -->|Yes| R[">> equivalent to 'delete' << Directly change to
USERSPACE<br/>Move to ent_in_userspace"]
Q -->|No| S["Do nothing"]
R --> T["io_uring_cmd_done(-ENOTCONN)"]
T --> U["Entry is 'disguised' as completed<br/>Will no longer
handle new FUSE requests"]
V["Practical effects of cancellation:"] --> W["1. Prevent new FUSE
requests from using this entry<br/>2. Release io_uring command
resources<br/>3. Does not affect already assigned FUSE requests"]
When the kernel is waiting for VFS requests and the corresponding entry
is idle, its state is FRRS_AVAILABLE. Once a request is handed off to
the userspace daemon, the entry's state transitions to FRRS_USERSPACE.
The fuse_uring_cmd function handles the COMMIT_AND_FETCH operation. If a
cmd call carries the IO_URING_F_CANCEL flag, fuse_uring_cancel is
invoked to mark the entry state as FRRS_USERSPACE, making it unavailable
for future requests from the VFS.
If the IORING_URING_CMD_CANCELABLE flag is not set, before committing
and fetching, we first call fuse_uring_prepare_cancel to mark the entry
as IORING_URING_CMD_CANCELABLE. This indicates that if the daemon exits
or an execve happens during fetch, the kernel can call
io_uring_try_cancel_uring_cmd to safely clean up these SQEs/CQEs and
related fuse resource.
Back to our previous issue, when deleting a FUSE export in QEMU, we hit
a crash due to an invalid CQE handler. This happened because the SQEs we
previously submitted hadn't returned yet by the time we shut down and
deleted the export.
To avoid this, we need to ensure that no further CQEs are returned and
no CQE handler is triggered. We need to either:
* Prevent any further user operations before calling blk_exp_close_all
or
* Require the userspace to trigger few specific operations that causes
the kernel to return all outstanding CQEs, and then the daemon can send
io_uring_cmd with the IO_URING_F_CANCEL flag to mark all entries as
unavailable (FRRS_USERSPACE) "delete operation", ensuring the kernel
won't assign them to future VFS requests.