On 2/28/23 17:05, Maxime Coquelin wrote:
> 
> 
> On 2/28/23 16:37, Ilya Maximets wrote:
>> On 2/28/23 16:16, Maxime Coquelin wrote:
>>> Hi,
>>>
>>> On 2/28/23 10:33, Wan Junjie wrote:
>>>> Hi,
>>>>
>>>> On Tue, Feb 28, 2023 at 4:03 PM David Marchand
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> On Fri, Feb 24, 2023 at 2:55 AM Han Ding <[email protected]> wrote:
>>>>>>> On 2/23/23 10:35, Han Ding wrote:
>>>>>>>>
>>>>>>>> When use ovs-vsctl to delete vhostuserclient port while qemu destroy 
>>>>>>>> virtio,
>>>>>>>> there is a deadlock between OVS main thread and the vhost-events 
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> openvswitch is 2.14 and dpdk is 20.11
>>>>>>>
>>>>>>> FWIW, 2.14 supposed to be used with 19.11.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks for reply.
>>>>>>
>>>>>> The code flow path of openvswitch 3.0.1 and dpdk 21.11, dpdk 22.11 is 
>>>>>> same
>>>>>> for destroying vhostuserclient device. So, I think  there is a same bug 
>>>>>> in new version.
>>>>>
>>>>> Please double check the dpdk versions you are using, then have a look
>>>>> at my comment below.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Main thread:
>>>>>>>> ofport_remove
>>>>>>>>     -> netdev_unref
>>>>>>>>         -> netdev_dpdk_vhost_destruct
>>>>>>>>              -> rte_vhost_driver_unregister (If fdentry is busy, 
>>>>>>>> circle again until the fdentry is not busy)
>>>>>>>>                  -> fdset_try_del  (fdentry is busy now, return -1. 
>>>>>>>> Goto again)
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> vhost-nets thread:
>>>>>>>> fdset_event_dispatch (set fdentry to busy. When destroy_device fuction 
>>>>>>>> return set the fdentry to no busy)
>>>>>>>>      -> vhost_user_read_cb
>>>>>>>>        -> vhost_user_msg_handler
>>>>>>>>            -> vhost_user_get_vring_base
>>>>>>>>                ->destroy_device
>>>>>>>>                   -> ovsrcu_synchronize (Wait for other threads to 
>>>>>>>> quiesce)
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> The  vhost-nets thread wait for main thread to quiesce, but the main 
>>>>>>>> thread now is waiting for fdentry to no busy
>>>>>>>> and circle all the time.
>>>>>>
>>>>
>>>>
>>>> I believe this issue exists in older versions, we have seen it before.
>>>>
>>>> When we create and destroy vm concurrently and continuously. This
>>>> could be reproduced easily.
>>>>
>>>> We found a related patch from dpdk which is not accepted.
>>>>
>>>> https://patchwork.dpdk.org/project/dpdk/patch/[email protected]/
>>>
>>> Thanks for the pointer.
>>>
>>> I spent some time thinking about it, and I do not see other alternative
>>> than doing what the patch you shared do, i.e. returning -EAGAIN and
>>> expecting the application to handle it properly.
>>>
>>> In order not to break existing applications that expect
>>> rte_vhost_driver_unregister to block, maybe we can introduce a new flag
>>> passed at registration time specifying the application support non-
>>> blocking unregistration.
>>>
>>> Without the flag passed, the lib would behave the same as currently,
>>> i.e. block until fdet_try_del() succeeds.
>>>
>>> Would that solution work with OVS?
>>
>> We can't really fail the netdev_dpdk_vhost_destruct() in OVS as there
>> is no mechanism to re-try it.
> 
> Would it be possible to enter into a quiescent state before retrying as
> the author of the initial patch suggests?
> "
> It is better to return -EAGAIN to application, who will decide how to handle
> (eg OVS can call ovsrcu_quiesce() and then retry).
> "

I didn't dig too deep, but that doesn't sound safe.
We're using ofproto object on this path higher in the call stack
and ofproto is RCU-protected.  Entering quiescent state here may
result in ofproto object being freed.

> 
>>> Any other alternative in mind?
>>
>> We may postpone the rte_vhost_driver_unregister() call in OVS, but
>> we'll need to make sure somehow that the same port is not re-created
>> while it's not yet destroyed.
>>
>> Just a thought, I don't really know how to implement that as failing
>> the netdev_open() might also be a problem.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Regards,
>>> Maxime
>>>
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Whether the ovsrcu_synchronize  is necessary in destrory_device
>>>>>>>
>>>>>>> Yes, we have to wait for all threads to stop using this device.
>>>>>>>
>>>>>>>> or the rte_vhost_driver_unregister is correct in fdset_try_del ?
>>>>>>>
>>>>>>> CC: David and Maxime.
>>>>>
>>>>> This could be a different issue, but there was a change for this area
>>>>> of the vhost library in v21.11:
>>>>> 451dc0fad83d ("vhost: fix crash on port deletion")
>>>>>
>>>>> I don't see it backported to 19.11 stable release, but it got to
>>>>> v20.11.4 (used in 2.15 and 2.16 ovs branches).
>>>>>
>>>>> I would suggest backporting it to your 19.11 dpdk and retest with ovs 
>>>>> 2.14.
>>>>>
>>>>>
>>>>> -- 
>>>>> David Marchand
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [email protected]
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to