On Thursday, November 30, 2023 8:42 PM, Heng Qi wrote:
<...>
> >>>> static void virtnet_remove(struct virtio_device *vdev)
> >>>> {
> >>>> struct virtnet_info *vi = vdev->priv;
> >>>> + int i;
> >>>>
> >>>> virtnet_cpu_notif_remove(vi);
> >>>>
> >>>> /* Make sure no work handler is accessing the device. */
> >>>> flush_work(&vi->config_work);
> >>>> + for (i = 0; i < vi->max_queue_pairs; i++)
> >>>> + cancel_work(&vi->rq[i].dim.work);
<...>
> There's cancel_work_sync() in v4 and I did reproduce the deadlock.
>
> rtnl_lock held -> .ndo_stop() -> cancel_work_sync() ->
> virtnet_rx_dim_work(),
> the work acquires the rtnl_lock again, then a deadlock occurs.
>
> I tested the scenario of ctrl cmd/.remove/.ndo_stop()/dim_work when there
> is
> a big concurrency, and cancel_work() works well.
I think the question here is why do you need call `cancel_work()` in `remove()`?
You already call it in `close()`, and the callstack is:
remove() -> unregister_netdev() -> rtnl_lock() -> ndo_stop() -> close()
And similarly, you don't need it in the unwind path in `probe()` either.
>
<...>