On Mon, Jun 27, 2022 at 06:43:34PM +0200, Caspar Schutijser wrote:
> On Mon, Jun 27, 2022 at 06:29:55PM +0200, Martin Pieuchot wrote:
> > On 27/06/22(Mon) 18:04, Caspar Schutijser wrote:
> > > On Sun, Jun 26, 2022 at 10:03:59PM +0200, Martin Pieuchot wrote:
> > > > On 26/06/22(Sun) 20:36, Caspar Schutijser wrote:
> > > > > A laptop of mine (dmesg below) frequently hangs. After some bisecting
> > > > > and extensive testing I think I found the commit that causes this:
> > > > > mpi@'s
> > > > > "Always acquire the `vmobjlock' before incrementing an object's 
> > > > > reference."
> > > > > commit from 2022-04-28.
> > > > > 
> > > > > My definition of "the system hangs": 
> > > > >  * Display is frozen
> > > > >  * Switching to ttyC0 using Ctrl+Alt+F1 doesn't do anything
> > > > >  * System does not respond to keyboard or mouse input
> > > > >  * Pressing the power button for 1-2 seconds doesn't achieve anything
> > > > > (usually this initiates a system shutdown)
> > > > >  * And also the fan starts spinning
> > > > > 
> > > > > The system sometimes hangs very soon after booting the system, I've
> > > > > seen it happen once while I was typing my username in xenodm to log 
> > > > > in.
> > > > > But sometimes it takes a couple of hours.
> > > > > 
> > > > > For some reason I put
> > > > > "@reboot while sleep 1 ; do sync ; done"
> > > > > in my crontab and it *seems* (I'm not sure) that the hangs occur more
> > > > > frequently this way. Not sure if that is useful information.
> > > > > 
> > > > > I don't see similar problems on my other machines.
> > > > > 
> > > > > It looks like when the system hangs, it's stuck spinning in the new
> > > > > code that was added in that commit; to confirm that I added some code
> > > > > (see the diff below) to enter ddb if it's spinning there for 10 
> > > > > seconds
> > > > > (and then it indeed enters ddb). If my thinking and diff make sense
> > > > > I think that indeed confirms that is the problem.
> > > > > 
> > > > > Any tips for debugging this?
> > > > 
> > > > I believe I introduced a deadlock.  If you can reproduce it could you
> > > > get us the output of `ps' in ddb(4) and the trace of all the active
> > > > processes.
> > > > 
> > > > I guess one is waiting for the KERNEL_LOCK() while holding the uobj's
> > > > vmobjlock.
> > > 
> > > "ps" output (pictures only):
> > > https://temp.schutijser.com/~caspar/2022-06-27-ddb/ps-1.jpg
> > > https://temp.schutijser.com/~caspar/2022-06-27-ddb/ps-2.jpg
> > > https://temp.schutijser.com/~caspar/2022-06-27-ddb/ps-3.jpg
> > > https://temp.schutijser.com/~caspar/2022-06-27-ddb/ps-4.jpg
> > > 
> > > 
> > > traces of active processes (I hope; if this is not correct I'm happy
> > > to run different commands; pictures and transcription follow):
> > > https://temp.schutijser.com/~caspar/2022-06-27-ddb/trace-1.jpg
> > > 
> > > ddb{1}> ps /o
> > >     TID    PID    UID    PRFLAGS    PFLAGS  CPU  COMMAND
> > > *246699  86564   1000        0x2         0    1K sync
> > >  395058  12288     48   0x100012         0    0  unwind
> > > ddb{1}> trace /t 0t246699
> > > kernel: protection fault trap, code=0
> > > Faulted in DDB; continuing...
> > > ddb{1}> trace /t 0t395058
> > > uvm_fault(0xfffffd8448ab5338, 0x1, 0, 1) -> e
> > > kernel: page fault trap, code=0
> > > Faulted in DDB; continuing...
> > > ddb{1}>
> > 
> > Is it a hang or a panic/fault?
> 
> This was a hang again; the laptop entered ddb after 10 seconds because
> of my patch. I've never seen it panic/fault because of this, I think.
> 
> > Here's a possible fix.  The idea is to
> > make the list private to the sync function such that we could sleep on
> > the lock without lock ordering reversal.
> > 
> > That means multiple sync could be started in parallel, this should be
> > fine as the objects are refcounted and only the first flush result in
> > I/O.
> 
> Thanks, I'll give this a spin and report back tomorrow or the day after
> (to allow for some proper testing).

In the end I tested your diff for three days. The laptop has been stable
with your diff applied (no more hangs), even though without your patch
I could practically trigger the problem at will. So it looks like your
diff fixes it. Thanks!

Caspar

Reply via email to