On 8/30/23 10:59, Laszlo Ersek wrote:
> On 8/30/23 10:41, Laszlo Ersek wrote:
>> I'm adding Stefan to the CC list, and an additional piece of explanation
>> below:
>>
>> On 8/27/23 20:29, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>
>>>> 3     General Initialization And Device Operation
>>>> 3.1   Device Initialization
>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>
>>>> [...]
>>>>
>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>>    the device, optional per-bus setup, reading and possibly writing the
>>>>    device’s virtio configuration space, and population of virtqueues.
>>>>
>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
>>>> 4         Virtio Transport Options
>>>> 4.1       Virtio Over PCI Bus
>>>> 4.1.4     Virtio Structure PCI Capabilities
>>>> 4.1.4.3   Common configuration structure layout
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> [...]
>>>>
>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>> virtqueue with queue_enable.
>>>>
>>>> [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>   [crates/vhost-user-backend/src/handler.rs] handles
>>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>   flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>>   The "VringEpollHandler::handle_event" method
>>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>>   virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>
>> I can explain why this issue has not been triggered by / witnessed with
>> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
>>
>> That driver registers *two* driver (callback) structures, a virtio
>> driver, and a filesystem driver.
>>
>> (1) The virtio driver half initializes the virtio device, and takes a
>> note of the particular virtio filesystem, remembering its "tag". See
>> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
>> -> virtio_fs_add_instance().
>>
>> Importantly, at this time, no FUSE_INIT request is sent.
>>
>> (2) The filesystem driver half has a totally independent entry point.
>> The relevant parts (after the driver registration) are:
>>
>> (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
>>
>> (b) if the "tag" was found, (b) virtio_fs_get_tree() ->
>> virtio_fs_fill_super() -> fuse_send_init().
>>
>> Importantly, this occurs when guest userspace (i.e., an interactive
>> user, or a userspace automatism such as systemd) tries to mount a
>> *concrete* virtio filesystem, identified by its tag (such as in "mount
>> -t virtiofs TAG /mount/point").
>>
>>
>> This means that there is an *abritrarily long* delay between (1)
>> VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the
>> guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest
>> kernel driver sends to virtiofsd while inside virtio_fs_get_tree()).
>>
>> That huge delay is plenty for masking the race.
>>
>> But the race is there nonetheless.
> 
> Furthermore, the race was not seen in the C-language virtiofsd
> implementation (removed in QEMU commit
> e0dc2631ec4ac718ebe22ddea0ab25524eb37b0e) for the following reason:
> 
> The C language virtiofsd *did not care* about
> VHOST_USER_SET_VRING_ENABLE at all:
> 
> - Upon VHOST_USER_GET_VRING_BASE, vu_get_vring_base_exec() in
> libvhost-user would call fv_queue_set_started() in virtiofsd, and the
> latter would start the data plane thread fv_queue_thread().

Sorry, not on VHOST_USER_GET_VRING_BASE, but on
VHOST_USER_SET_VRING_KICK / VHOST_USER_VRING_KICK. But that doesn't
change the rest of the argument, namely that VHOST_USER_SET_VRING_ENABLE
had no effect whatsoever on the C-language virtiofsd.

Laszlo

> 
> - Upon VHOST_USER_SET_VRING_ENABLE, vu_set_vring_enable_exec() in
> libvhost-user would set the "enable" field, but not call back into
> virtiofsd. And virtiofsd ("tools/virtiofsd/fuse_virtio.c") nowhere
> checks the "enable" field.
> 
> In summary, the C-language virtiofsd didn't implement queue enablement
> in a conformant way. The Rust-language version does, but that exposes a
> race in how QEMU sends VHOST_USER_SET_VRING_ENABLE. The race is
> triggered by the OVMF guest driver, and not triggered by the Linux guest
> driver (since the latter introduces an unbounded delay between vring
> enablement and FUSE_INIT submission).
> 
> Laszlo
> 
>>
>>
>> Also note that this race does not exist for vhost-net. For vhost-net,
>> AIUI, such queue operations are handled with ioctl()s, and ioctl()s are
>> synchronous by nature. Cf.
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>:
>>
>> "The original vhost-user specification only demands replies for certain
>> commands. This differs from the vhost protocol implementation where
>> commands are sent over an ioctl() call and block until the back-end has
>> completed."
>>
>> Laszlo
>>
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>> plane processor takes notice of the queue being enabled.
>>>
>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>
>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>>   for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>>   has been negotiated, or
>>>
>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>>   a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>
>>> Cc: "Michael S. Tsirkin" <m...@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <epere...@redhat.com>
>>> Cc: German Maglione <gmagli...@redhat.com>
>>> Cc: Liu Jiang <ge...@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <s...@redhat.com>
>>> Cc: Stefano Garzarella <sgarz...@redhat.com>
>>> Signed-off-by: Laszlo Ersek <ler...@redhat.com>
>>> ---
>>>  hw/virtio/vhost-user.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index beb4b832245e..01e0ca90c538 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct 
>>> vhost_dev *dev, int enable)
>>>              .num   = enable,
>>>          };
>>>  
>>> -        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, 
>>> false);
>>> +        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, 
>>> true);
>>>          if (ret < 0) {
>>>              /*
>>>               * Restoring the previous state is likely infeasible, as well 
>>> as
>>
> 


Reply via email to