Re: [Intel-wired-lan] [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock

Keller, Jacob E Mon, 02 Jun 2025 14:33:01 -0700

> -----Original Message-----
> From: Joe Damato <[email protected]>
> Sent: Monday, June 2, 2025 1:32 PM
> To: Jakub Kicinski <[email protected]>
> Cc: Stanislav Fomichev <[email protected]>; [email protected];
> [email protected]; Keller, Jacob E <[email protected]>;
> [email protected]; Nguyen, Anthony L
> <[email protected]>; Kitszel, Przemyslaw
> <[email protected]>; Andrew Lunn <[email protected]>; David
> S. Miller <[email protected]>; Eric Dumazet <[email protected]>;
> Paolo Abeni <[email protected]>; moderated list:INTEL ETHERNET DRIVERS
> <[email protected]>; open list <[email protected]>
> Subject: Re: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock
> 
> On Fri, May 30, 2025 at 06:31:40PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 May 2025 12:45:13 -0700 Joe Damato wrote:
> > > > nit: as Jakub mentioned in another thread, it seems more about the
> > > > flush_work waiting for the reset_task to complete rather than
> > > > wq mutexes (which are fake)?
> > >
> > > Hm, I probably misunderstood something. Also, not sure what you
> > > meant by the wq mutexes being fake?
> > >
> > > My understanding (which is prob wrong) from the syzbot and user
> > > report was that the order of wq mutex and rtnl are inverted in the
> > > two paths, which can cause a deadlock if both paths run.
> >
> > Take a look at touch_work_lockdep_map(), theres nosaj thing as wq mutex.
> > It's just a lockdep "annotation" that helps lockdep connect the dots
> > between waiting thread and the work item, not a real mutex. So the
> > commit msg may be better phrased like this (modulo the lines in front):
> >
> >    CPU 0:
> >   , - RTNL is held
> >  /  - e1000_close
> >  |  - e1000_down
> >  +- - cancel_work_sync (cancel / wait for e1000_reset_task())
> >  |
> >  | CPU 1:
> >  |  - process_one_work
> >   \ - e1000_reset_task
> >    `- take RTNL
> 
> OK, I'll resubmit shortly with the following commit message:
> 
>     e1000: Move cancel_work_sync to avoid deadlock
> 
>     Previously, e1000_down called cancel_work_sync for the e1000 reset task
>     (via e1000_down_and_stop), which takes RTNL.
> 
>     As reported by users and syzbot, a deadlock is possible in the following
>     scenario:
> 
>     CPU 0:
>       - RTNL is held
>       - e1000_close
>       - e1000_down
>       - cancel_work_sync (cancel / wait for e1000_reset_task())
> 
>     CPU 1:
>       - process_one_work
>       - e1000_reset_task
>       - take RTNL
> 
>     To remedy this, avoid calling cancel_work_sync from e1000_down
>     (e1000_reset_task does nothing if the device is down anyway). Instead,
>     call cancel_work_sync for e1000_reset_task when the device is being
>     removed.

Acked-by: Jacob Keller <[email protected]>
Re: [Intel-wired-lan] [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock

Reply via email to