On 29/01/18(Mon) 21:25, Artturi Alm wrote:
> On Mon, Jan 29, 2018 at 08:03:38PM +0100, Martin Pieuchot wrote:
> > On 29/01/18(Mon) 20:38, Artturi Alm wrote:
> > > On Mon, Jan 29, 2018 at 10:42:20AM +0100, Martin Pieuchot wrote:
> > > > Hello Artturi,
> > > > 
> > > > On 28/01/18(Sun) 09:08, Artturi Alm wrote:
> > > > > >Synopsis:    stuck in netlock
> > > > > >Category:    amd64
> > > > > >Environment:
> > > > >       System      : OpenBSD 6.2
> > > > >       Details     : OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan  7 
> > > > > 09:13:00 MST 2018
> > > > >                        
> > > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > > > 
> > > > >       Architecture: OpenBSD.amd64
> > > > >       Machine     : amd64
> > > > > >Description:
> > > > >       processes getting stuck w/STATE=netlock, kill has no effect.
> > > > > >How-To-Repeat:
> > > > >       using the desktop normally, until trying to restart chrome ends
> > > > >       up failing.
> > > > 
> > > > What do you mean with "using the desktop normally"?  Which applications
> > > > are you using?  Which browser plugins?  Can you find out the minimum
> > > > setup to reproduce this deadlock?
> > > > 
> > > > >       I've had this happen to me atleast twice in the last few of 
> > > > > weeks.
> > > > 
> > > > Do you know how to reproduce it easily?
> > > > 
> > > 
> > > this time i had less than 10tabs open, so i guess it can be narrowed
> > > down even further.
> > > 
> > > > >       At first time i noticed how trying to launch chrome did lock up
> > > > >       all the other processes in netlock, and "pkill chrome" did allow
> > > > >       the system to recover, i was unable to figure out what was wrong
> > > > >       and rebooting did make everything work again, while ie.
> > > > >       removing ~/.cache & ~/.config did not.
> > > > 
> > > > So the deadlock is related to your chrome usage?
> > > > 
> > > 
> > > now it does feel like so. i'll upgrade tonight.
> > > 
> > > > >       long before running the "ps cl" below, i had already killed all
> > > > >       the xterm-windows those processes were in. cwm(1) was unable to
> > > > >       kill some of those, but xkill did not.
> > > > 
> > > > Well killing process waiting for the 'netlock' won't help.  What has to
> > > > be find is which process is holding it.  For that we need the full ps
> > > > output, including kernel and userland threads.
> > > > > 
> > > > >       after exiting X w/ctrl+alt+backspace(iirc?) i didn't get back to
> > > > >       $-prompt, and ^T did show xauth stuck in netlock..
> > > > >       i guess it's obvious where it was heading; so i got pics of
> > > > >       "# reboot -nq" failing because stuck in the fckng netlock -_-
> > > > > 
> > > > >       i do have ddb.{panic,console,log}=1, but
> > > > >       "# sysctl ddb.trigger=1" ==
> > > > >       "sysctl: ddb.trigger: Operation not supported by device"
> > > > 
> > > > Not having DDB access will limit the debugging experience.  Are you sure
> > > > you tried to enter it on your console?
> > > > 
> > > 
> > > so this requires ttyC0, right?
> > > this time it was ifconfig in [netlock], that prevented using ttyC0.
> > > i got there from X by running "virsh shutdown <domain" from the kvm host,
> > > i guess it emulates what pressing actual power button would(acpi?).
> > > 
> > > > >       ?? so i had no option but "virsh reset <domain>"...
> > > > 
> > > > Did you try top(1)?  What were the kernel processes doing?
> > > 
> > > see below, if "top -bCHS -d 1 999" should do.
> > > anything else i could do? anyway, thanks in advance:)
> > 
> > This is where the problems comes from: 
> > 
> > > 33315   443734  -6    0  141M  102M idle      viowait   0:00  0.00% 
> > > chrome: 
> > 
> > I don't understand how chrome can end up sleeping in vio_ioctl() and why
> > it is sleeping forever.  But this thread is holding the NET_LOCK() and
> > prevents the rest of the kernel from making progress.
> > 
> > Could you try a virtual interface different from vio(4) and see if you
> > can reproduce the problem?
> 
> Will try with 'e1000', but then this does seem to me like it would have
> something to do with routing too(?), as the vio0 is only for reaching to
> the host.
> and separate physical interface, to which the default route belongs to.

Here's a diff to fix vio(4), could you give it a go?

Index: dev/pv/if_vio.c
===================================================================
RCS file: /cvs/src/sys/dev/pv/if_vio.c,v
retrieving revision 1.4
diff -u -p -r1.4 if_vio.c
--- dev/pv/if_vio.c     10 Aug 2017 18:03:51 -0000      1.4
+++ dev/pv/if_vio.c     23 Feb 2018 09:14:29 -0000
@@ -1276,7 +1276,8 @@ vio_wait_ctrl(struct vio_softc *sc)
        int r = 0;
 
        while (sc->sc_ctrl_inuse != FREE) {
-               r = tsleep(&sc->sc_ctrl_inuse, PRIBIO|PCATCH, "viowait", 0);
+               r = rwsleep(&sc->sc_ctrl_inuse, &netlock, PRIBIO|PCATCH,
+                   "viowait", 0);
                if (r == EINTR)
                        return r;
        }
@@ -1295,7 +1296,8 @@ vio_wait_ctrl_done(struct vio_softc *sc)
                        r = 1;
                        break;
                }
-               r = tsleep(&sc->sc_ctrl_inuse, PRIBIO|PCATCH, "viodone", 0);
+               r = rwsleep(&sc->sc_ctrl_inuse, &netlock, PRIBIO|PCATCH,
+                   "viodone", 0);
                if (r == EINTR)
                        break;
        }

Reply via email to