Calling napi_disable() on an already disabled napi can cause the
deadlock. In commit 4bc12818b363 ("virtio-net: disable delayed refill
when pausing rx"), to avoid the deadlock, when pausing the RX in
virtnet_rx_pause[_all](), we disable and cancel the delayed refill work.
However, in the virtnet_rx_resume_all(), we enable the delayed refill
work too early before enabling all the receive queue napis.

The deadlock can be reproduced by running
selftests/drivers/net/hw/xsk_reconfig.py with multiqueue virtio-net
device and inserting a cond_resched() inside the for loop in
virtnet_rx_resume_all() to increase the success rate. Because the worker
processing the delayed refilled work runs on the same CPU as
virtnet_rx_resume_all(), a reschedule is needed to cause the deadlock.
In real scenario, the contention on netdev_lock can cause the
reschedule.

In this series, we make the refill work a per receive queue work instead
so that we can manage them separately and avoid further mistakes.

- Patch 1 makes the refill work a per receive queue work. It fixes the
deadlock in reproducer because now we only need to ensure refill work is
scheduled after NAPI of its receive queue is enabled not all NAPIs of all
queues. After this patch, enable_delayed_refill is stilled called before
napi_enable in virtnet_rx_resume[_all] but I don't how the work can be
scheduled in that window.
- Patch 2 moves the enable_delayed_refill after napi_enable and fixes the
deadlock variant in virtnet_open.
- Patch 3 fixes the issue arises when enable_delayed_refill is moved after
napi_enable. The issue is that a refill work might need to be scheduled in
virtnet_receive but cannot because refill work is disabled. This can lead
to receive side stuck.So we need to set a pending bit, later when refill
work is enabled, the work is scheduled.

All 3 patches need to be applied to fix the issue so does it mean I need
to add Fixes and Cc stable for all 3?

Link to the previous approach and discussion:
https://lore.kernel.org/netdev/[email protected]/

Reported-by: Paolo Abeni <[email protected]>
Closes: 
https://netdev-ctrl.bots.linux.dev/logs/vmksft/drv-hw-dbg/results/400961/3-xdp-py/stderr

Thanks,
Quang Minh.

Bui Quang Minh (3):
  virtio-net: make refill work a per receive queue work
  virtio-net: ensure rx NAPI is enabled before enabling refill work
  virtio-net: schedule the pending refill work after being enabled

 drivers/net/virtio_net.c | 173 ++++++++++++++++++++-------------------
 1 file changed, 91 insertions(+), 82 deletions(-)

-- 
2.43.0


Reply via email to