On 2/28/23 16:37, Ilya Maximets wrote:
On 2/28/23 16:16, Maxime Coquelin wrote:
Hi,

On 2/28/23 10:33, Wan Junjie wrote:
Hi,

On Tue, Feb 28, 2023 at 4:03 PM David Marchand
<[email protected]> wrote:

Hello,

On Fri, Feb 24, 2023 at 2:55 AM Han Ding <[email protected]> wrote:
On 2/23/23 10:35, Han Ding wrote:

When use ovs-vsctl to delete vhostuserclient port while qemu destroy virtio,
there is a deadlock between OVS main thread and the vhost-events thread.

openvswitch is 2.14 and dpdk is 20.11

FWIW, 2.14 supposed to be used with 19.11.

Indeed.



Thanks for reply.

The code flow path of openvswitch 3.0.1 and dpdk 21.11, dpdk 22.11 is same
for destroying vhostuserclient device. So, I think  there is a same bug in new 
version.

Please double check the dpdk versions you are using, then have a look
at my comment below.





Main thread:
ofport_remove
    -> netdev_unref
        -> netdev_dpdk_vhost_destruct
             -> rte_vhost_driver_unregister (If fdentry is busy, circle again 
until the fdentry is not busy)
                 -> fdset_try_del  (fdentry is busy now, return -1. Goto again)




vhost-nets thread:
fdset_event_dispatch (set fdentry to busy. When destroy_device fuction return 
set the fdentry to no busy)
     -> vhost_user_read_cb
       -> vhost_user_msg_handler
           -> vhost_user_get_vring_base
               ->destroy_device
                  -> ovsrcu_synchronize (Wait for other threads to quiesce)



The  vhost-nets thread wait for main thread to quiesce, but the main thread now 
is waiting for fdentry to no busy
and circle all the time.



I believe this issue exists in older versions, we have seen it before.

When we create and destroy vm concurrently and continuously. This
could be reproduced easily.

We found a related patch from dpdk which is not accepted.

https://patchwork.dpdk.org/project/dpdk/patch/[email protected]/

Thanks for the pointer.

I spent some time thinking about it, and I do not see other alternative
than doing what the patch you shared do, i.e. returning -EAGAIN and
expecting the application to handle it properly.

In order not to break existing applications that expect
rte_vhost_driver_unregister to block, maybe we can introduce a new flag
passed at registration time specifying the application support non-
blocking unregistration.

Without the flag passed, the lib would behave the same as currently,
i.e. block until fdet_try_del() succeeds.

Would that solution work with OVS?

We can't really fail the netdev_dpdk_vhost_destruct() in OVS as there
is no mechanism to re-try it.

Would it be possible to enter into a quiescent state before retrying as
the author of the initial patch suggests?
"
It is better to return -EAGAIN to application, who will decide how to handle
(eg OVS can call ovsrcu_quiesce() and then retry).
"

Any other alternative in mind?

We may postpone the rte_vhost_driver_unregister() call in OVS, but
we'll need to make sure somehow that the same port is not re-created
while it's not yet destroyed.

Just a thought, I don't really know how to implement that as failing
the netdev_open() might also be a problem.

Best regards, Ilya Maximets.


Regards,
Maxime




Whether the ovsrcu_synchronize  is necessary in destrory_device

Yes, we have to wait for all threads to stop using this device.

or the rte_vhost_driver_unregister is correct in fdset_try_del ?

CC: David and Maxime.

This could be a different issue, but there was a change for this area
of the vhost library in v21.11:
451dc0fad83d ("vhost: fix crash on port deletion")

I don't see it backported to 19.11 stable release, but it got to
v20.11.4 (used in 2.15 and 2.16 ovs branches).

I would suggest backporting it to your 19.11 dpdk and retest with ovs 2.14.


--
David Marchand

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev




_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to