Re: [PATCH 1/2] vhost-user: fix lost reconnect

Li Feng Thu, 24 Aug 2023 00:31:28 -0700

> On 22 Aug 2023, at 6:17 PM, Raphael Norwitz <raphael.norw...@nutanix.com> 
> wrote:
> 
> 
> 
>> On Aug 22, 2023, at 12:49 AM, Li Feng <fen...@smartx.com> wrote:
>> 
>> 
>> 
>>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norw...@nutanix.com> 
>>> wrote:
>>> 
>>>> 
>>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fen...@smartx.com> wrote:
>>>> 
>>>> 
>>>>> 2023年8月14日 下午8:11，Raphael Norwitz <raphael.norw...@nutanix.com> 写道：
>>>>> 
>>>>> Why can’t we rather fix this by adding a “event_cb” param to 
>>>>> vhost_user_async_close and then call qemu_chr_fe_set_handlers in 
>>>>> vhost_user_async_close_bh()?
>>>>> 
>>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future 
>>>>> changes may easily stumble over the reconnect case and introduce crashes 
>>>>> or double frees.
>>>>> 
>>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ 
>>>> has been called in vhost_user_async_close, and will be called in 
>>>> event->cb, so why need add a
>>>> new event_cb?
>>>> 
>>> 
>>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit 
>>> the error case where vhost->vdev is NULL. Something like:
>>> 
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 8dcf049d42..edf1dccd44 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -2648,6 +2648,10 @@ typedef struct {
>>> static void vhost_user_async_close_bh(void *opaque)
>>> {
>>>    VhostAsyncCallback *data = opaque;
>>> +
>>> +    VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
>>> +    VHostUserBlk *s = VHOST_USER_BLK(vdev);
>>> +
>>>    struct vhost_dev *vhost = data->vhost;
>>> 
>>>    /*
>>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
>>>     */
>>>    if (vhost->vdev) {
>>>        data->cb(data->dev);
>>> +    } else if (data->event_cb) {
>>> +        qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
>>> +                                 NULL, data->dev, NULL, true);
>>>    }
>>> 
>>>    g_free(data);
>>> 
>>> data->event_cb would be vhost_user_blk_event().
>>> 
>>> I think that makes the error path a lot easier to reason about and more 
>>> future proof.
>>> 
>>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in 
>>>> struct vhost-dev to mark if it’s inited like this:
>>>> 
>>> 
>>> This is better than the original, but let me know what you think of my 
>>> alternative.
>> 
>> The vhost_user_async_close_bh() is a common function in vhost-user.c, and 
>> vhost_user_async_close() is used by vhost-user-scsi/blk/gpio, 
>> However, in your patch it’s limited to VhostUserBlk, so I think my fix is 
>> more reasonable.
> 
> I did not write out the full patch. 
> 
> vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like 
> they each provide a different ->cb today. Looking at it again, data has the 
> chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK().
> 
> The fix generalizes to all device types.
OK, I will change it in V2.
> 
>> 
>> Thanks,
>> LI
>>> 
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index e2f6ffb446..edc80c0231 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void 
>>>> *opaque,
>>>>       goto fail_busyloop;
>>>>   }
>>>> 
>>>> +    hdev->inited = true;
>>>>   return 0;
>>>> 
>>>> fail_busyloop:
>>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
>>>> {
>>>>   int i;
>>>> 
>>>> +    if (!hdev->inited) {
>>>> +        return;
>>>> +    }
>>>> +    hdev->inited = false;
>>>>   trace_vhost_dev_cleanup(hdev);
>>>> 
>>>>   for (i = 0; i < hdev->nvqs; ++i) {
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index ca3131b1af..74b1aec960 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -123,6 +123,7 @@ struct vhost_dev {
>>>>   /* @started: is the vhost device started? */
>>>>   bool started;
>>>>   bool log_enabled;
>>>> +    bool inited;
>>>>   uint64_t log_size;
>>>>   Error *migration_blocker;
>>>>   const VhostOps *vhost_ops;
>>>> 
>>>> Thanks.
>>>> 
>>>>> 
>>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fen...@smartx.com> wrote:
>>>>>> 
>>>>>> When the vhost-user is reconnecting to the backend, and if the 
>>>>>> vhost-user fails
>>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail
>>>>>> and it will not be retriggered forever.
>>>>>> 
>>>>>> The reason is:
>>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be 
>>>>>> called
>>>>>> immediately.
>>>>>> 
>>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
>>>>>> 
>>>>>> The reconnect path is:
>>>>>> vhost_user_blk_event
>>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..)
>>>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback
>>>>>>   schedule vhost_user_async_close_bh
>>>>>> 
>>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
>>>>>> called, then the event fd callback will not be reinstalled.
>>>>>> 
>>>>>> With this patch, the vhost_user_blk_disconnect will call the
>>>>>> vhost_dev_cleanup() again, it's safe.
>>>>>> 
>>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi.
>>>>>> 
>>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
>>>>>> 
>>>>>> Signed-off-by: Li Feng <fen...@smartx.com>
>>>>>> ---
>>>>>> hw/virtio/vhost-user.c | 10 +---------
>>>>>> 1 file changed, 1 insertion(+), 9 deletions(-)
>>>>>> 
>>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>>> index 8dcf049d42..697b403fe2 100644
>>>>>> --- a/hw/virtio/vhost-user.c
>>>>>> +++ b/hw/virtio/vhost-user.c
>>>>>> @@ -2648,16 +2648,8 @@ typedef struct {
>>>>>> static void vhost_user_async_close_bh(void *opaque)
>>>>>> {
>>>>>> VhostAsyncCallback *data = opaque;
>>>>>> -    struct vhost_dev *vhost = data->vhost;
>>>>>> 
>>>>>> -    /*
>>>>>> -     * If the vhost_dev has been cleared in the meantime there is
>>>>>> -     * nothing left to do as some other path has completed the
>>>>>> -     * cleanup.
>>>>>> -     */
>>>>>> -    if (vhost->vdev) {
>>>>>> -        data->cb(data->dev);
>>>>>> -    }
>>>>>> +    data->cb(data->dev);
>>>>>> 
>>>>>> g_free(data);
>>>>>> }
>>>>>> -- 
>>>>>> 2.41.0
Re: [PATCH 1/2] vhost-user: fix lost reconnect

Reply via email to