> -----Original Message-----
> From: Joe Damato <[email protected]>
> Sent: Monday, June 2, 2025 1:32 PM
> To: Jakub Kicinski <[email protected]>
> Cc: Stanislav Fomichev <[email protected]>; [email protected];
> [email protected]; Keller, Jacob E <[email protected]>;
> [email protected]; Nguyen, Anthony L
> <[email protected]>; Kitszel, Przemyslaw
> <[email protected]>; Andrew Lunn <[email protected]>; David
> S. Miller <[email protected]>; Eric Dumazet <[email protected]>;
> Paolo Abeni <[email protected]>; moderated list:INTEL ETHERNET DRIVERS
> <[email protected]>; open list <[email protected]>
> Subject: Re: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock
>
> On Fri, May 30, 2025 at 06:31:40PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 May 2025 12:45:13 -0700 Joe Damato wrote:
> > > > nit: as Jakub mentioned in another thread, it seems more about the
> > > > flush_work waiting for the reset_task to complete rather than
> > > > wq mutexes (which are fake)?
> > >
> > > Hm, I probably misunderstood something. Also, not sure what you
> > > meant by the wq mutexes being fake?
> > >
> > > My understanding (which is prob wrong) from the syzbot and user
> > > report was that the order of wq mutex and rtnl are inverted in the
> > > two paths, which can cause a deadlock if both paths run.
> >
> > Take a look at touch_work_lockdep_map(), theres nosaj thing as wq mutex.
> > It's just a lockdep "annotation" that helps lockdep connect the dots
> > between waiting thread and the work item, not a real mutex. So the
> > commit msg may be better phrased like this (modulo the lines in front):
> >
> > CPU 0:
> > , - RTNL is held
> > / - e1000_close
> > | - e1000_down
> > +- - cancel_work_sync (cancel / wait for e1000_reset_task())
> > |
> > | CPU 1:
> > | - process_one_work
> > \ - e1000_reset_task
> > `- take RTNL
>
> OK, I'll resubmit shortly with the following commit message:
>
> e1000: Move cancel_work_sync to avoid deadlock
>
> Previously, e1000_down called cancel_work_sync for the e1000 reset task
> (via e1000_down_and_stop), which takes RTNL.
>
> As reported by users and syzbot, a deadlock is possible in the following
> scenario:
>
> CPU 0:
> - RTNL is held
> - e1000_close
> - e1000_down
> - cancel_work_sync (cancel / wait for e1000_reset_task())
>
> CPU 1:
> - process_one_work
> - e1000_reset_task
> - take RTNL
>
> To remedy this, avoid calling cancel_work_sync from e1000_down
> (e1000_reset_task does nothing if the device is down anyway). Instead,
> call cancel_work_sync for e1000_reset_task when the device is being
> removed.
Acked-by: Jacob Keller <[email protected]>