Re: amd64: stuck in netlock
On 29/01/18(Mon) 21:25, Artturi Alm wrote: > On Mon, Jan 29, 2018 at 08:03:38PM +0100, Martin Pieuchot wrote: > > On 29/01/18(Mon) 20:38, Artturi Alm wrote: > > > On Mon, Jan 29, 2018 at 10:42:20AM +0100, Martin Pieuchot wrote: > > > > Hello Artturi, > > > > > > > > On 28/01/18(Sun) 09:08, Artturi Alm wrote: > > > > > >Synopsis:stuck in netlock > > > > > >Category:amd64 > > > > > >Environment: > > > > > System : OpenBSD 6.2 > > > > > Details : OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan 7 > > > > > 09:13:00 MST 2018 > > > > > > > > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > > > > > > > Architecture: OpenBSD.amd64 > > > > > Machine : amd64 > > > > > >Description: > > > > > processes getting stuck w/STATE=netlock, kill has no effect. > > > > > >How-To-Repeat: > > > > > using the desktop normally, until trying to restart chrome ends > > > > > up failing. > > > > > > > > What do you mean with "using the desktop normally"? Which applications > > > > are you using? Which browser plugins? Can you find out the minimum > > > > setup to reproduce this deadlock? > > > > > > > > > I've had this happen to me atleast twice in the last few of > > > > > weeks. > > > > > > > > Do you know how to reproduce it easily? > > > > > > > > > > this time i had less than 10tabs open, so i guess it can be narrowed > > > down even further. > > > > > > > > At first time i noticed how trying to launch chrome did lock up > > > > > all the other processes in netlock, and "pkill chrome" did allow > > > > > the system to recover, i was unable to figure out what was wrong > > > > > and rebooting did make everything work again, while ie. > > > > > removing ~/.cache & ~/.config did not. > > > > > > > > So the deadlock is related to your chrome usage? > > > > > > > > > > now it does feel like so. i'll upgrade tonight. > > > > > > > > long before running the "ps cl" below, i had already killed all > > > > > the xterm-windows those processes were in. cwm(1) was unable to > > > > > kill some of those, but xkill did not. > > > > > > > > Well killing process waiting for the 'netlock' won't help. What has to > > > > be find is which process is holding it. For that we need the full ps > > > > output, including kernel and userland threads. > > > > > > > > > > after exiting X w/ctrl+alt+backspace(iirc?) i didn't get back to > > > > > $-prompt, and ^T did show xauth stuck in netlock.. > > > > > i guess it's obvious where it was heading; so i got pics of > > > > > "# reboot -nq" failing because stuck in the fckng netlock -_- > > > > > > > > > > i do have ddb.{panic,console,log}=1, but > > > > > "# sysctl ddb.trigger=1" == > > > > > "sysctl: ddb.trigger: Operation not supported by device" > > > > > > > > Not having DDB access will limit the debugging experience. Are you sure > > > > you tried to enter it on your console? > > > > > > > > > > so this requires ttyC0, right? > > > this time it was ifconfig in [netlock], that prevented using ttyC0. > > > i got there from X by running "virsh shutdown > > i guess it emulates what pressing actual power button would(acpi?). > > > > > > > > ?? so i had no option but "virsh reset "... > > > > > > > > Did you try top(1)? What were the kernel processes doing? > > > > > > see below, if "top -bCHS -d 1 999" should do. > > > anything else i could do? anyway, thanks in advance:) > > > > This is where the problems comes from: > > > > > 33315 443734 -60 141M 102M idle viowait 0:00 0.00% > > > chrome: > > > > I don't understand how chrome can end up sleeping in vio_ioctl
Re: uvideo0: could not open VS pipe: INVAL
On 26/02/18(Mon) 15:20, C. wrote: > Category: > Webcam / Video > > Environment: > System : OpenBSD 6.2 > Details : OpenBSD 6.2 (GENERIC.MP) #5: Fri Feb 2 23:02:19 CET > 2018 > > r...@syspatch-62-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > Architecture: OpenBSD.amd64 > Machine : amd64 > > Description: > The integraged webcam (Lenovo Thinkpad T470, AzureWave Integrated > Webcam) does not work. > It's neither working in Firefox, nor in Chromium, nor in VLC, nor via > fswebcam or luvcview. > The firmware for the uvideo driver has been installed. There's currently no support for isochronous transfers on xhci(4). Some code is there but it has to be debugged and enabled.
Re: gdb hangs on exiting a running program
Thanks for the report. On 19/03/18(Mon) 09:49, Theo Buehler wrote: > This is a regression that came with the TOCTOU race fix in kern_sig.c 1.216: > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216 > [...] > Now gdb just hangs there and does nothing instead of exiting as > expected. It doesn't react to ^C but one can easily kill it with > ^Z and then kill %%. What happens is that the programs stays stopped. Or to be more precise re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by gdb(1). The problem comes from the fact that CURSIG() is now called twice in userret(). That means that issignal() is also called twice. The fix is to treat SIGKILL as special if the process is currently traced. That's also what NetBSD is doing, so I synced our comment with their, without a typo. ok? Index: kern/kern_sig.c === RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.216 diff -u -p -r1.216 kern_sig.c --- kern/kern_sig.c 26 Feb 2018 13:33:25 - 1.216 +++ kern/kern_sig.c 19 Mar 2018 11:25:34 - @@ -1167,11 +1167,13 @@ issignal(struct proc *p) (pr->ps_flags & PS_TRACED) == 0) continue; - if ((pr->ps_flags & (PS_TRACED | PS_PPWAIT)) == PS_TRACED) { - /* -* If traced, always stop, and stay -* stopped until released by the debugger. -*/ + /* +* If traced, always stop, and stay stopped until released +* by the debugger. If our parent process is waiting for +* us, don't hang as we could deadlock. +*/ + if (((pr->ps_flags & (PS_TRACED | PS_PPWAIT)) == PS_TRACED) && + signum != SIGKILL) { p->p_xstat = signum; if (dolock)
Re: vmctl stop + tcpdump results in netlock panic
On 19/03/18(Mon) 15:58, Stefan Sperling wrote: > The following will trigger "panic: rw_enter: netlock locking against myself": The solution is to call bpfdetach() outside of the NET_LOCK(), it should not need it. Diff below does that, does it work for you? Index: net/if.c === RCS file: /cvs/src/sys/net/if.c,v retrieving revision 1.548 diff -u -p -r1.548 if.c --- net/if.c2 Mar 2018 15:52:11 - 1.548 +++ net/if.c19 Mar 2018 15:22:17 - @@ -1028,6 +1028,10 @@ if_detach(struct ifnet *ifp) /* Other CPUs must not have a reference before we start destroying. */ if_idxmap_remove(ifp); +#if NBPFILTER > 0 + bpfdetach(ifp); +#endif + NET_LOCK(); s = splnet(); ifp->if_qstart = if_detached_qstart; @@ -1041,9 +1045,6 @@ if_detach(struct ifnet *ifp) /* Remove the link state task */ task_del(net_tq(ifp->if_index), &ifp->if_linkstatetask); -#if NBPFILTER > 0 - bpfdetach(ifp); -#endif rti_delete(ifp); #if NETHER > 0 && defined(NFSCLIENT) if (ifp->if_index == revarp_ifidx)
Re: gdb hangs on exiting a running program
On 19/03/18(Mon) 15:38, Visa Hankala wrote: > On Mon, Mar 19, 2018 at 12:27:10PM +0100, Martin Pieuchot wrote: > > Thanks for the report. > > > > On 19/03/18(Mon) 09:49, Theo Buehler wrote: > > > This is a regression that came with the TOCTOU race fix in kern_sig.c > > > 1.216: > > > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216 > > > [...] > > > Now gdb just hangs there and does nothing instead of exiting as > > > expected. It doesn't react to ^C but one can easily kill it with > > > ^Z and then kill %%. > > > > What happens is that the programs stays stopped. Or to be more precise > > re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by > > gdb(1). > > The problem comes from the fact that CURSIG() is now called twice in > > userret(). That means that issignal() is also called twice. The fix > > is to treat SIGKILL as special if the process is currently traced. > > As an alternative, the double call of issignal() could be avoided. I like this. But I still think that we should handle SIGKILL correctly in CURSIG(). However your fix seems safer for release. > CURSIG(p) evaluates to zero if p->p_siglist is zero, or eventually > issignal(p) returns zero if there are no unmasked signals (that is, > if (p->p_siglist & ~p->p_sigmask) == 0). But if the process is being traced issignal() is always called. Does that mean that the `PS_TRACED' check is useless because issignal() also starts with if (p->p_siglist & ~p->p_sigmask) == 0? I'd prefer if you could used a function (inline) with an explicit name like hassignal() or unmaskedsignal()? > Index: kern/kern_sig.c > === > RCS file: src/sys/kern/kern_sig.c,v > retrieving revision 1.216 > diff -u -p -r1.216 kern_sig.c > --- kern/kern_sig.c 26 Feb 2018 13:33:25 - 1.216 > +++ kern/kern_sig.c 19 Mar 2018 15:28:33 - > @@ -1833,7 +1833,7 @@ userret(struct proc *p) > KERNEL_UNLOCK(); > } > > - if (CURSIG(p) != 0) { > + if ((p->p_siglist & ~p->p_sigmask) != 0) { > KERNEL_LOCK(); > while ((signum = CURSIG(p)) != 0) > postsig(p, signum);
Re: NFS socket use after free during reboot
On 08/03/18(Thu) 23:16, Alexander Bluhm wrote: > Hi, > > When rebooting the NFS client while the NFS file system is actively > used, the kernel crashes. The socket at 0xd73c2d9c is filled with > dead beef, so it is a use after free. It is an i386 kernel built > today. There are multiple known issues with umounting a busy NFS client. These issues were previously masked by the "remount read-only" logic at shutdown. > root@ot2:.../~# find /mount >/dev/null & sleep 5; reboot -q > [1] 9698 > syncing disks... uvm_fault(0xd72afc7c, 0x1ff11000, 0, 1) -> e > kernel: page fault trap, code=0 > Stopped at sblock+0x12:movl0x4(%eax),%eax > ddb{0}> trace > sblock(d73c2d9c,d73c2df0,1) at sblock+0x12 > soreceive(d73c2d9c,0,f548d818,f548d884,0,f548d804,0) at soreceive+0x271 > nfs_receive(d7471f7c,f548d87c,f548d884) at nfs_receive+0xb1 > nfs_reply(d7471f7c) at nfs_reply+0x62 > nfs_request(d6d1f3c4,10,f548d970) at nfs_request+0x24d > nfs_readdirrpc(d6d1f3c4,f548d9f8,d7499120,f548d9ec) at nfs_readdirrpc+0x1dc > nfs_readdir(f548dab0) at nfs_readdir+0x227 > VOP_READDIR(d6d1f3c4,f548daf8,d7499120,f548daec) at VOP_READDIR+0x42 > sys_getdents(d71372dc,f548db68,f548db60) at sys_getdents+0x118 > syscall() at syscall+0x204 > --- syscall (number 0) --- Your trace shows two things. First of all the userland thread doing getdents(2) is getting schedule after nfs_unmount() has freed the socket. Secondly it shows that such thread has no way to know that the socket is no longer valid. My previous attempt to fix this problem, my preventing all reconnect as soon as nfs_unmount() has been called only moved the panic to a different layer because NFS node don't have proper locking. So here's a diff to add locking to NFS nodes. I couldn't reproduce the panic above with it. So I'd be interested if you could try it. Note that I didn't do much tests in write mode, so I'd suggest exporting your "/mount" as 'ro' in a first time. Diskless setups are also probably broken. Index: nfs/nfs_node.c === RCS file: /cvs/src/sys/nfs/nfs_node.c,v retrieving revision 1.65 diff -u -p -r1.65 nfs_node.c --- nfs/nfs_node.c 27 Sep 2016 01:37:38 - 1.65 +++ nfs/nfs_node.c 20 Mar 2018 12:31:40 - @@ -58,8 +58,6 @@ struct pool nfs_node_pool; extern int prtactive; -struct rwlock nfs_hashlock = RWLOCK_INITIALIZER("nfshshlk"); - /* XXX */ extern struct vops nfs_vops; @@ -98,12 +96,10 @@ nfs_nget(struct mount *mnt, nfsfh_t *fh, nmp = VFSTONFS(mnt); loop: - rw_enter_write(&nfs_hashlock); find.n_fhp = fh; find.n_fhsize = fhsize; np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find); if (np != NULL) { - rw_exit_write(&nfs_hashlock); vp = NFSTOV(np); error = vget(vp, LK_EXCLUSIVE, p); if (error) @@ -120,25 +116,28 @@ loop: * to see if this nfsnode has been added while we did not hold * the lock. */ - rw_exit_write(&nfs_hashlock); error = getnewvnode(VT_NFS, mnt, &nfs_vops, &nvp); /* note that we don't have this vnode set up completely yet */ - rw_enter_write(&nfs_hashlock); if (error) { *npp = NULL; - rw_exit_write(&nfs_hashlock); return (error); } nvp->v_flag |= VLARVAL; - np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find); - if (np != NULL) { + np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO); + /* +* getnewvnode() and pool_get() can sleep, check for race. +*/ + if (RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find) != NULL) { + pool_put(&nfs_node_pool, np); vgone(nvp); - rw_exit_write(&nfs_hashlock); goto loop; } vp = nvp; - np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO); +#ifdef VFSLCKDEBUG + vp->v_flag |= VLOCKSWORK; +#endif + rrw_init_flags(&np->n_lock, "nfsnode", RWL_DUPOK | RWL_IS_VNODE); vp->v_data = np; /* we now have an nfsnode on this vnode */ vp->v_flag &= ~VLARVAL; @@ -159,10 +158,11 @@ loop: np->n_fhp = &np->n_fh; bcopy(fh, np->n_fhp, fhsize); np->n_fhsize = fhsize; + /* lock the nfsnode, then put it on the rbtree */ + rrw_enter(&np->n_lock, RW_WRITE); np2 = RBT_INSERT(nfs_nodetree, &nmp->nm_ntree, np); KASSERT(np2 == NULL); np->n_accstamp = -1; - rw_exit(&nfs_hashlock); *npp = np; return (0); @@ -201,9 +201,10 @@ nfs_inactive(void *v) * Remove the silly file that was rename'd earlier */ nfs_vinvalbuf(ap->a_vp, 0, sp->s_cred, curproc); + vn_lock(sp->s_dvp, LK_EXCLUSIVE | LK_RETRY, curproc); nfs_removeit(sp); crfree(sp->s_cred); - vrele(sp->s_dvp); +
Re: gdb hangs on exiting a running program
On 20/03/18(Tue) 17:04, Visa Hankala wrote: > On Tue, Mar 20, 2018 at 10:45:56AM +0100, Martin Pieuchot wrote: > > On 19/03/18(Mon) 15:38, Visa Hankala wrote: > > > On Mon, Mar 19, 2018 at 12:27:10PM +0100, Martin Pieuchot wrote: > > > > Thanks for the report. > > > > > > > > On 19/03/18(Mon) 09:49, Theo Buehler wrote: > > > > > This is a regression that came with the TOCTOU race fix in kern_sig.c > > > > > 1.216: > > > > > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216 > > > > > [...] > > > > > Now gdb just hangs there and does nothing instead of exiting as > > > > > expected. It doesn't react to ^C but one can easily kill it with > > > > > ^Z and then kill %%. > > > > > > > > What happens is that the programs stays stopped. Or to be more precise > > > > re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by > > > > gdb(1). > > > > The problem comes from the fact that CURSIG() is now called twice in > > > > userret(). That means that issignal() is also called twice. The fix > > > > is to treat SIGKILL as special if the process is currently traced. > > > > > > As an alternative, the double call of issignal() could be avoided. > > > > I like this. But I still think that we should handle SIGKILL correctly > > in CURSIG(). However your fix seems safer for release. > > > > > CURSIG(p) evaluates to zero if p->p_siglist is zero, or eventually > > > issignal(p) returns zero if there are no unmasked signals (that is, > > > if (p->p_siglist & ~p->p_sigmask) == 0). > > > > But if the process is being traced issignal() is always called. Does > > that mean that the `PS_TRACED' check is useless because issignal() also > > starts with if (p->p_siglist & ~p->p_sigmask) == 0? > > So it seems. The trace point is taken only if the signal mask allows > signal delivery. > > > I'd prefer if you could used a function (inline) with an explicit name > > like hassignal() or unmaskedsignal()? > > Updated diff: I like it. I you don't return a boolean but the mask of pending signals in the macro we could use it in issignal(). But that can be for a later change. ok mpi@ > Index: kern/kern_sig.c > === > RCS file: src/sys/kern/kern_sig.c,v > retrieving revision 1.216 > diff -u -p -r1.216 kern_sig.c > --- kern/kern_sig.c 26 Feb 2018 13:33:25 - 1.216 > +++ kern/kern_sig.c 20 Mar 2018 16:53:25 - > @@ -1833,7 +1833,7 @@ userret(struct proc *p) > KERNEL_UNLOCK(); > } > > - if (CURSIG(p) != 0) { > + if (SIGPENDING(p)) { > KERNEL_LOCK(); > while ((signum = CURSIG(p)) != 0) > postsig(p, signum); > Index: sys/signalvar.h > === > RCS file: src/sys/sys/signalvar.h,v > retrieving revision 1.29 > diff -u -p -r1.29 signalvar.h > --- sys/signalvar.h 26 Feb 2018 13:33:25 - 1.29 > +++ sys/signalvar.h 20 Mar 2018 16:53:25 - > @@ -66,6 +66,11 @@ struct sigacts { > #define SIG_HOLD(void (*)(int))3 > > /* > + * Check if process p has an unmasked signal pending. > + */ > +#define SIGPENDING(p) (((p)->p_siglist & ~(p)->p_sigmask) != 0) > + > +/* > * Determine signal that should be delivered to process p, the current > * process, 0 if none. If there is a pending stop signal with default > * action, the process stops in issignal().
Re: NFS socket use after free during reboot
On 20/03/18(Tue) 20:09, Alexander Bluhm wrote: > On Tue, Mar 20, 2018 at 02:24:40PM +0100, Martin Pieuchot wrote: > > So here's a diff to add locking to NFS nodes. I couldn't reproduce the > > panic above with it. So I'd be interested if you could try it. Note > > that I didn't do much tests in write mode, so I'd suggest exporting your > > "/mount" as 'ro' in a first time. Diskless setups are also probably > > broken. > > This diff fixes my reboot test case. I was only using a read-only > mount when I reported the panic. > > But now the /usr/src/regress/sys/ffs/nfs test hangs in "nfsnode". Because I forgot to unlock the parent's vnode in nfs_remove(), diff below fixes that. Index: nfs/nfs_node.c === RCS file: /cvs/src/sys/nfs/nfs_node.c,v retrieving revision 1.65 diff -u -p -r1.65 nfs_node.c --- nfs/nfs_node.c 27 Sep 2016 01:37:38 - 1.65 +++ nfs/nfs_node.c 20 Mar 2018 12:31:40 - @@ -58,8 +58,6 @@ struct pool nfs_node_pool; extern int prtactive; -struct rwlock nfs_hashlock = RWLOCK_INITIALIZER("nfshshlk"); - /* XXX */ extern struct vops nfs_vops; @@ -98,12 +96,10 @@ nfs_nget(struct mount *mnt, nfsfh_t *fh, nmp = VFSTONFS(mnt); loop: - rw_enter_write(&nfs_hashlock); find.n_fhp = fh; find.n_fhsize = fhsize; np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find); if (np != NULL) { - rw_exit_write(&nfs_hashlock); vp = NFSTOV(np); error = vget(vp, LK_EXCLUSIVE, p); if (error) @@ -120,25 +116,28 @@ loop: * to see if this nfsnode has been added while we did not hold * the lock. */ - rw_exit_write(&nfs_hashlock); error = getnewvnode(VT_NFS, mnt, &nfs_vops, &nvp); /* note that we don't have this vnode set up completely yet */ - rw_enter_write(&nfs_hashlock); if (error) { *npp = NULL; - rw_exit_write(&nfs_hashlock); return (error); } nvp->v_flag |= VLARVAL; - np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find); - if (np != NULL) { + np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO); + /* +* getnewvnode() and pool_get() can sleep, check for race. +*/ + if (RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find) != NULL) { + pool_put(&nfs_node_pool, np); vgone(nvp); - rw_exit_write(&nfs_hashlock); goto loop; } vp = nvp; - np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO); +#ifdef VFSLCKDEBUG + vp->v_flag |= VLOCKSWORK; +#endif + rrw_init_flags(&np->n_lock, "nfsnode", RWL_DUPOK | RWL_IS_VNODE); vp->v_data = np; /* we now have an nfsnode on this vnode */ vp->v_flag &= ~VLARVAL; @@ -159,10 +158,11 @@ loop: np->n_fhp = &np->n_fh; bcopy(fh, np->n_fhp, fhsize); np->n_fhsize = fhsize; + /* lock the nfsnode, then put it on the rbtree */ + rrw_enter(&np->n_lock, RW_WRITE); np2 = RBT_INSERT(nfs_nodetree, &nmp->nm_ntree, np); KASSERT(np2 == NULL); np->n_accstamp = -1; - rw_exit(&nfs_hashlock); *npp = np; return (0); @@ -201,9 +201,10 @@ nfs_inactive(void *v) * Remove the silly file that was rename'd earlier */ nfs_vinvalbuf(ap->a_vp, 0, sp->s_cred, curproc); + vn_lock(sp->s_dvp, LK_EXCLUSIVE | LK_RETRY, curproc); nfs_removeit(sp); crfree(sp->s_cred); - vrele(sp->s_dvp); + vput(sp->s_dvp); free(sp, M_NFSREQ, sizeof(*sp)); } np->n_flag &= (NMODIFIED | NFLUSHINPROG | NFLUSHWANT); @@ -239,9 +240,7 @@ nfs_reclaim(void *v) ap->a_vp); #endif nmp = VFSTONFS(vp->v_mount); - rw_enter_write(&nfs_hashlock); RBT_REMOVE(nfs_nodetree, &nmp->nm_ntree, np); - rw_exit_write(&nfs_hashlock); if (np->n_rcred) crfree(np->n_rcred); Index: nfs/nfs_vfsops.c === RCS file: /cvs/src/sys/nfs/nfs_vfsops.c,v retrieving revision 1.116 diff -u -p -r1.116 nfs_vfsops.c --- nfs/nfs_vfsops.c10 Feb 2018 05:24:23 - 1.116 +++ nfs/nfs_vfsops.c20 Mar 2018 10:27:24 - @@ -178,7 +178,7 @@ nfs_statfs(struct mount *mp, struct stat copy_statfs_info(sbp, mp); m_freem(info.nmi_mrep); nfsmout: - vrele(vp); + vput(vp); crfree(cred);
modesetting driver broke video(1)
Since we switched to the modesetting driver by default, the supported XvImage formats no longer include YUY2 nor UYVY which are expected by video(1). Using the following Xorg.conf makes video(1) works again. Section "Device" Identifier "Device0" Driver "intel" EndSection Attached are the outputs of xvinfo(1) with the modesetting driver and the intel driver. X-Video Extension version 2.2 screen #0 Adaptor #0: "GLAMOR Textured Video" number of ports: 16 port base: 96 operations supported: PutImage supported visuals: depth 24, visualID 0x21 number of attributes: 5 "XV_BRIGHTNESS" (range -1000 to 1000) client settable attribute client gettable attribute (current value is 0) "XV_CONTRAST" (range -1000 to 1000) client settable attribute client gettable attribute (current value is 0) "XV_SATURATION" (range -1000 to 1000) client settable attribute client gettable attribute (current value is 0) "XV_HUE" (range -1000 to 1000) client settable attribute client gettable attribute (current value is 0) "XV_COLORSPACE" (range 0 to 1) client settable attribute client gettable attribute (current value is 0) maximum XvImage size: 8192 x 8192 Number of image formats: 2 id: 0x32315659 (YV12) guid: 59563132--0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) id: 0x30323449 (I420) guid: 49343230--0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) X-Video Extension version 2.2 screen #0 Adaptor #0: "Intel(R) Textured Video" number of ports: 16 port base: 75 operations supported: PutImage supported visuals: depth 24, visualID 0x20 number of attributes: 1 "XV_SYNC_TO_VBLANK" (range -1 to 1) client settable attribute client gettable attribute (current value is 1) maximum XvImage size: 16384 x 16384 Number of image formats: 5 id: 0x32595559 (YUY2) guid: 59555932--0010-8000-00aa00389b71 bits per pixel: 16 number of planes: 1 type: YUV (packed) id: 0x32315659 (YV12) guid: 59563132--0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) id: 0x30323449 (I420) guid: 49343230--0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) id: 0x59565955 (UYVY) guid: 55595659--0010-8000-00aa00389b71 bits per pixel: 16 number of planes: 1 type: YUV (packed) id: 0x434d5658 (XVMC) guid: 58564d43--0010-8000-00aa00389b71 bits per pixel: 12 number of planes: 3 type: YUV (planar) Adaptor #1: "Intel(R) Video Sprite" number of ports: 1 port base: 91 operations supported: PutImage supported visuals: depth 24, visualID 0x20 number of attributes: 2 "XV_COLORKEY" (range 0 to 16777215) client settable attribute client gettable attribute (current value is 66046) "XV_ALWAYS_ON_TOP" (range 0 to 1) client settable attribute client gettable attribute (current value is 0) maximum XvImage size: 8192 x 8192 Number of image formats: 3 id: 0x32595559 (YUY2) guid: 59555932--0010-8000-00aa00389b71 bits per pixel: 16 number of planes: 1 type: YUV (packed) id: 0x59565955 (UYVY) guid: 55595659--0010-8000-00aa00389b71 bits per pixel: 16 number of planes: 1 type: YUV (packed) id: 0x18424752 guid: 50415353-5448-524f-5547-485247423234 bits per pixel: 32 number of planes: 1 type: RGB (packed) depth: 24 red, green, blue masks: 0xff, 0xff00, 0xff
Re: Kernel Panic on 6.2 amd64 when run0 RT3070 based device is attached during boot
gh-speed, mmc high-speed, dma > pchb2 at pci0 dev 24 function 0 "AMD AMD64 16h Link Cfg" rev 0x00 > pchb3 at pci0 dev 24 function 1 "AMD AMD64 16h Address Map" rev 0x00 > pchb4 at pci0 dev 24 function 2 "AMD AMD64 16h DRAM Cfg" rev 0x00 > km0 at pci0 dev 24 function 3 "AMD AMD64 16h Misc Cfg" rev 0x00 > pchb5 at pci0 dev 24 function 4 "AMD AMD64 16h CPU Power" rev 0x00 > pchb6 at pci0 dev 24 function 5 vendor "AMD", unknown product 0x1535 rev > 0x00 > usb3 at ohci0: USB revision 1.0 > uhub3 at usb3 configuration 1 interface 0 "AMD OHCI root hub" rev > 1.00/1.00 addr 1 > usb4 at ohci1: USB revision 1.0 > uhub4 at usb4 configuration 1 interface 0 "AMD OHCI root hub" rev > 1.00/1.00 addr 1 > isa0 at pcib0 > isadma0 at isa0 > com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo > pckbc0 at isa0 port 0x60/5 irq 1 irq 12 > pcppi0 at isa0 port 0x61 > spkr0 at pcppi0 > vmm0 at mainbus0: SVM/RVI > sdmmc0: can't enable card > axen0 at uhub0 port 1 configuration 1 interface 0 "ASIX Elec. Corp. > AX88179" rev 3.00/1.00 addr 2 > axen0: AX88179, address xx:xx:xx:xx:xx:xx > rgephy0 at axen0 phy 3: RTL8169S/8110S/8211 PHY, rev. 5 > uhidev0 at uhub3 port 1 configuration 1 interface 0 "Dell Dell Smart > Card Reader Keyboard" rev 2.00/1.00 addr 2 > uhidev0: iclass 3/1 > ukbd0 at uhidev0: 8 variable keys, 6 key codes > wskbd0 at ukbd0: console keyboard, using wsdisplay0 > ugen0 at uhub3 port 1 configuration 1 "Dell Dell Smart Card Reader > Keyboard" rev 2.00/1.00 addr 2 > vscsi0 at root > scsibus2 at vscsi0: 256 targets > softraid0 at root > scsibus3 at softraid0: 256 targets > softraid0: sd1 was not shutdown properly > sd1 at scsibus3 targ 1 lun 0: SCSI2 0/direct fixed > sd1: 476937MB, 512 bytes/sector, 976767473 sectors > root on sd1a (9990ff6713f15d12.a) swap on sd1b dump on sd1b > WARNING: / was not properly unmounted > -- > > Denis > > On 1/25/2018 5:34 PM, Martin Pieuchot wrote: > > Hello Denis, > > > > On 25/01/18(Thu) 17:16, Denis wrote: > >> Finally catch kernel panic in the middle of run adapter work. > > > > Could you please set ddb.panic to 1? > > > > It's hard to figure out what's wrong in your reports because as soon as > > your machine tries to reboot it panics, panics and panics again. So we > > can't tell what is the first (real) problem. > > > > And please stop cross posting. bugs@ is enough for such problems :) > > > > Thanks, > > Martin > >
Re: amd64/machdep knob: forceukb forcing wrong encoding.
On 05/02/18(Mon) 18:31, Artturi Alm wrote: > On Mon, Feb 05, 2018 at 02:51:48PM +0100, Martin Pieuchot wrote: > > On 04/02/18(Sun) 11:28, Artturi Alm wrote: > > > Hi, > > > > > > machdep.forceukbd=1 feels broken to me, as i use "sv", and it doesn't > > > respect > > > /etc/kbdtype. > > > > If you unplug/replug your USB keyboard after having booted does it > > respect /etc/kbdtype? > > Yes, no issues when machdep.forceukbd=0, and i do that unplug/replug-dance > "in software" several times a day, as i use the same mouse+keyboard > on my VM for games. Diff below fixes the problem. Turns out that the layout configured with kbd(8) is stored in the mux. But the value of the mux wasn't read for console keyboard since it is supposed to attach first. Index: dev/wscons/wskbd.c === RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v retrieving revision 1.90 diff -u -p -r1.90 wskbd.c --- dev/wscons/wskbd.c 19 Feb 2018 08:59:52 - 1.90 +++ dev/wscons/wskbd.c 27 Mar 2018 11:35:51 - @@ -373,21 +373,11 @@ wskbd_attach(struct device *parent, stru #endif #if NWSMUX > 0 mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux; - if (ap->console) { - /* Ignore mux for console; it always goes to the console mux. */ - /* printf(" (mux %d ignored for console)", mux); */ - mux = -1; - } if (mux >= 0) { printf(" mux %d", mux); wsmux_sc = wsmux_getmux(mux); } else wsmux_sc = NULL; -#else -#if 0 /* not worth keeping, especially since the default value is not -1... */ - if (sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux >= 0) - printf(" (mux ignored)"); -#endif #endif /* NWSMUX > 0 */ if (ap->console) { @@ -462,7 +452,8 @@ wskbd_attach(struct device *parent, stru printf("\n"); #if NWSMUX > 0 - if (wsmux_sc != NULL) { + /* Ignore mux for console; it always goes to the console mux. */ + if (wsmux_sc != NULL && ap->console == 0) { error = wsmux_attach_sc(wsmux_sc, &sc->sc_base); if (error) printf("%s: attach error=%d\n",
Re: amd64/machdep knob: forceukb forcing wrong encoding.
On 10/04/18(Tue) 11:57, Mark Kettenis wrote: > > Date: Tue, 27 Mar 2018 13:40:02 +0200 > > From: Martin Pieuchot > > > > On 05/02/18(Mon) 18:31, Artturi Alm wrote: > > > On Mon, Feb 05, 2018 at 02:51:48PM +0100, Martin Pieuchot wrote: > > > > On 04/02/18(Sun) 11:28, Artturi Alm wrote: > > > > > Hi, > > > > > > > > > > machdep.forceukbd=1 feels broken to me, as i use "sv", and it doesn't > > > > > respect > > > > > /etc/kbdtype. > > > > > > > > If you unplug/replug your USB keyboard after having booted does it > > > > respect /etc/kbdtype? > > > > > > Yes, no issues when machdep.forceukbd=0, and i do that unplug/replug-dance > > > "in software" several times a day, as i use the same mouse+keyboard > > > on my VM for games. > > > > Diff below fixes the problem. Turns out that the layout configured with > > kbd(8) is stored in the mux. But the value of the mux wasn't read for > > console keyboard since it is supposed to attach first. > > > > Index: dev/wscons/wskbd.c > > === > > RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v > > retrieving revision 1.90 > > diff -u -p -r1.90 wskbd.c > > --- dev/wscons/wskbd.c 19 Feb 2018 08:59:52 - 1.90 > > +++ dev/wscons/wskbd.c 27 Mar 2018 11:35:51 - > > @@ -373,21 +373,11 @@ wskbd_attach(struct device *parent, stru > > #endif > > #if NWSMUX > 0 > > mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux; > > - if (ap->console) { > > - /* Ignore mux for console; it always goes to the console mux. */ > > - /* printf(" (mux %d ignored for console)", mux); */ > > - mux = -1; > > - } > > if (mux >= 0) { > > printf(" mux %d", mux); > > Should this printf be skipped for the console? I don't mind, if we go this way here's a diff. Index: dev/wscons/wskbd.c === RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v retrieving revision 1.90 diff -u -p -r1.90 wskbd.c --- dev/wscons/wskbd.c 19 Feb 2018 08:59:52 - 1.90 +++ dev/wscons/wskbd.c 10 Apr 2018 10:37:53 - @@ -362,7 +362,7 @@ wskbd_attach(struct device *parent, stru struct wskbddev_attach_args *ap = aux; kbd_t layout; #if NWSMUX > 0 - struct wsmux_softc *wsmux_sc; + struct wsmux_softc *wsmux_sc = NULL; int mux, error; #endif @@ -373,21 +373,8 @@ wskbd_attach(struct device *parent, stru #endif #if NWSMUX > 0 mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux; - if (ap->console) { - /* Ignore mux for console; it always goes to the console mux. */ - /* printf(" (mux %d ignored for console)", mux); */ - mux = -1; - } - if (mux >= 0) { - printf(" mux %d", mux); + if (mux >= 0) wsmux_sc = wsmux_getmux(mux); - } else - wsmux_sc = NULL; -#else -#if 0 /* not worth keeping, especially since the default value is not -1... */ - if (sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux >= 0) - printf(" (mux ignored)"); -#endif #endif /* NWSMUX > 0 */ if (ap->console) { @@ -459,14 +446,14 @@ wskbd_attach(struct device *parent, stru printf(", using %s", sc->sc_displaydv->dv_xname); #endif } - printf("\n"); #if NWSMUX > 0 - if (wsmux_sc != NULL) { + /* Ignore mux for console; it always goes to the console mux. */ + if (wsmux_sc != NULL && ap->console == 0) { + printf(" mux %d", mux); error = wsmux_attach_sc(wsmux_sc, &sc->sc_base); if (error) - printf("%s: attach error=%d\n", - sc->sc_base.me_dv.dv_xname, error); + printf(": attach error=%d", error); /* * Try and set this encoding as the mux default if it @@ -479,6 +466,7 @@ wskbd_attach(struct device *parent, stru wsmux_set_layout(wsmux_sc, layout); } #endif + printf("\n"); #if NWSDISPLAY > 0 && NWSMUX == 0 if (ap->console == 0) {
Re: Thunar dies and dumps core
On 10/04/18(Tue) 19:49, sudhir kumar lal wrote: > Hi, > > I use CWM and snapshot of OpenBSD 6.3 and thunar crashes a lot on my > system too. But it opens files on my system nicely, it only crashes when i > use Shift+Delete to delete a file. then it core dumps and dies almost every > time! It's due to a race in the kqueue(2) backend. Here's a diff for devel/glib2 that should improve the situation. I'm going to submit the diff below, commit 1124732 upstream. Index: Makefile === RCS file: /cvs/ports/devel/glib2/Makefile,v retrieving revision 1.270 diff -u -p -r1.270 Makefile --- Makefile20 Feb 2018 16:59:19 - 1.270 +++ Makefile11 Apr 2018 14:21:00 - @@ -9,7 +9,7 @@ COMMENT=general-purpose utility librar GNOME_PROJECT= glib GNOME_VERSION= 2.54.3 PKGNAME= ${DISTNAME:S/glib/glib2/} -REVISION= 1 +REVISION= 2 CATEGORIES=devel Index: patches/patch-00_kqueue_fix === RCS file: patches/patch-00_kqueue_fix diff -N patches/patch-00_kqueue_fix --- /dev/null 1 Jan 1970 00:00:00 - +++ patches/patch-00_kqueue_fix 11 Apr 2018 14:26:44 - @@ -0,0 +1,2060 @@ +commit aa39a0557c679fc345b0ba72a87c33152eb8ebcd +Author: Martin Pieuchot +Date: Tue Feb 20 16:57:00 2018 + + +kqueue: Multiple fixes and simplifications + + - Stop using a custom thread for listening to kqueue(2) events. Instead + call kevent(2) in non blocking mode in a monitor callback. Under the + hood poll(2) is used to figure out if new events are available. + + - Do not use a socketpair with a custom protocol requiring 2 supplementary + context switches per event to commicate between multiple threads. Calling + kevent(2), in non blocking mode, to add/remove events is fine from any + context. + + - Add kqueue(2) events without the EV_ONESHOT flag. This removes a race + where some notifications were lost because events had to be re-added for + every new notification. + + - Get rid of the global hash table and its associated lock and races. Use + the 'cookie' argument of kevent(2) to pass the associated descriptor when + registering an event. + + - Fix _kh_file_appeared_cb() by properly passing a monitor instead of a + source to g_file_monitor_emit_event(). + + - Properly refcount sources. + + - Remove a lot of abstraction making it harder to fix the remaining issues. + +https://bugzilla.gnome.org/show_bug.cgi?id=739424 + +diff --git gio/kqueue/Makefile.am gio/kqueue/Makefile.am +index d5657d7e4..24e9724e5 100644 +--- gio/kqueue/Makefile.am gio/kqueue/Makefile.am +@@ -4,19 +4,9 @@ noinst_LTLIBRARIES += libkqueue.la + + libkqueue_la_SOURCES = \ +gkqueuefilemonitor.c \ +- gkqueuefilemonitor.h \ +kqueue-helper.c \ +kqueue-helper.h \ +- kqueue-thread.c \ +- kqueue-thread.h \ +- kqueue-sub.c \ +- kqueue-sub.h \ +kqueue-missing.c \ +- kqueue-missing.h \ +- kqueue-utils.c \ +- kqueue-utils.h \ +- kqueue-exclusions.c \ +- kqueue-exclusions.h \ +dep-list.c \ +dep-list.h \ +$(NULL) +diff --git gio/kqueue/gkqueuefilemonitor.c gio/kqueue/gkqueuefilemonitor.c +index 78b749637..deed8b1e1 100644 +--- gio/kqueue/gkqueuefilemonitor.c gio/kqueue/gkqueuefilemonitor.c +@@ -22,33 +22,73 @@ + + #include "config.h" + +-#include "gkqueuefilemonitor.h" +-#include "kqueue-helper.h" +-#include "kqueue-exclusions.h" ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include + #include + #include +-#include ++#include ++#include "glib-private.h" ++ ++#include "kqueue-helper.h" ++#include "dep-list.h" ++ ++G_LOCK_DEFINE_STATIC (kq_lock); ++static GSource *kq_source; ++static int kq_queue = -1; ++ ++#define G_TYPE_KQUEUE_FILE_MONITOR(g_kqueue_file_monitor_get_type ()) ++#define G_KQUEUE_FILE_MONITOR(inst) (G_TYPE_CHECK_INSTANCE_CAST ((inst), \ ++ G_TYPE_KQUEUE_FILE_MONITOR, GKqueueFileMonitor)) + ++typedef GLocalFileMonitorClass GKqueueFileMonitorClass; + +-struct _GKqueueFileMonitor ++typedef struct + { + GLocalFileMonitor parent_instance; + + kqueue_sub *sub; +- ++#ifndef O_EVTONLY + GFileMonitor *fallback; + GFile *fbfile; +-}; ++#endif ++} GKqueueFileMonitor; ++ ++GType g_kqueue_file_monitor_get_type (void); ++G_DEFINE_TYPE_WITH_CODE (GKqueueFileMonitor, g_kqueue_file_monitor, G_TYPE_LOCAL_FILE_MONITOR, ++ g_io_extension_point_implement (G_LOCAL_FILE_MONITOR_EXTENSION_POINT_NAME, ++
Re: ddb(4): p[rint] man page example vs. result.
On 09/05/18(Wed) 07:48, Artturi Alm wrote: > On Tue, May 08, 2018 at 01:44:39AM +0300, Artturi Alm wrote: No bug are irrelevant to fix. But working with you is hard, really hard. You never explain what the problem is. Reading your email is an exercise in frustration because you can do some good work but you fail to communicate. > > (manual "copypaste"): > > nc2k4hp# sysctl ddb.trigger=1 > > Stopped at db_enter+0x4: popl%ebp > > ddb{0}> print/x "eax = " $eax "\necx = " $ecx "\n" > > 3 > > ddb{0}> c > > ddb.trigger: 0 -> 1 > > > > so, for reasons yet unknown to me, p[rint] doesn't seem to work at all > > like described in the man page, tested on i386. What do no work? What does the man page describe? Do you expect us to read the man page, then look at your mail again, then try to understand what is not working? > > Should it work? I hope it would. What should work? Why do you hope? Maybe the manpage should be fixed? > Does feel like waste of time to go any further fixing this, if this is > yet another bug too irrelevant for anyone to ack for, so _any_ input > here would be great. Like I said, no bug are irrelevant but if the one finding the bug, you in that case, is not willing to properly explain the problem, then better not send an email at all ;)
Re: ddb(4): p[rint] man page example vs. result.
On 09/05/18(Wed) 12:13, Artturi Alm wrote: > On Wed, May 09, 2018 at 10:23:41AM +0200, Martin Pieuchot wrote: > > On 09/05/18(Wed) 07:48, Artturi Alm wrote: > > > On Tue, May 08, 2018 at 01:44:39AM +0300, Artturi Alm wrote: > > > > > > No bug are irrelevant to fix. But working with you is hard, really > > hard. You never explain what the problem is. Reading your email is > > an exercise in frustration because you can do some good work but you > > fail to communicate. > > > > > > (manual "copypaste"): > > > > nc2k4hp# sysctl ddb.trigger=1 > > > > Stopped at db_enter+0x4: popl%ebp > > > > ddb{0}> print/x "eax = " $eax "\necx = " $ecx "\n" > > > > 3 > > > > ddb{0}> c > > > > ddb.trigger: 0 -> 1 > > > > > > > > so, for reasons yet unknown to me, p[rint] doesn't seem to work at all > > > > like described in the man page, tested on i386. > > > > What do no work? What does the man page describe? Do you expect us to > > read the man page, then look at your mail again, then try to understand > > what is not working? > > > > For example, > > print/x "eax = " $eax "\necx = " $ecx "\n" > > will print something like this: > > eax = xx > ecx = yy > > Now I did install 5.0 into a VM, and there the result for above example > would of have been just "Ambiguous", and I'm guessing now that this > has not been working as in the example since import. > My fix is limited to producing output just like in the example, but > input requires more, as it needs escapes for everything not a-z,A-Z,0-9. > > > > > Should it work? I hope it would. > > > > What should work? Why do you hope? Maybe the manpage should be fixed? > > > > Multiple [addr] arguments to p[rint], including support for strings, > and i hope so because i would find it useful while testing/writing/porting > drivers. Maybe, I do like "show struct", and have more than just > the filtering diff for it, but it doesn't really work for the ad hoc > usecases p[rint] seems so excellent for. > > > > Does feel like waste of time to go any further fixing this, if this is > > > yet another bug too irrelevant for anyone to ack for, so _any_ input > > > here would be great. > > > > Like I said, no bug are irrelevant but if the one finding the bug, you > > in that case, is not willing to properly explain the problem, then > > better not send an email at all ;) > > Will try in the future. Thanks for the explanation! > haven't tested the diff below yet, but compared to previous, it should > have working /modifierS. IMHO we should just amend the man page and keep ddb(4) code simple.
Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c
On 08/05/18(Tue) 22:26, Michael-John Turner wrote: > [...] > ndp info overwritten for fe80:d::b408:97aa:a658:760e by 40:85:1b:ab:69:d5 on > vlan41 > ndp info overwritten for fe80:c::b408:97aa:a658:760e by c8:00:44:93:05:62 on > vlan40 Could you post your routing table so we can understand which ND entries are overwritten and if it is normal?
Re: (bug || timewaste)usr.bin/ctfconv: should vlen be 0 for CTF_K_ARRAYs ?
On 13/05/18(Sun) 05:36, Artturi Alm wrote: > Hi, > > > I was looking at fixing my code for ctf pprinting arrays in ddb(4), > and came across ctf in section 5 man pages for freebsd with google, > which lead me to wondering about this, and even think about possibility > of an bug here, since the ctf(5)[0] mostly matches what i've seen so > far in OpenBSD otherwise(didn't see direct asserts/ifs yet to make > sure CTF_K_ARRAY is always handled in the ctf_stype short form thought). > > In it, under "Type Encoding" vlen is described like: > +o The length of the variable data > > and under "Encoding of Arrays" has this: > "Arrays, which are of type CTF_K_ARRAY, have no variable length arguments." > > so the above doesn't hold currently, should it? You can check yourself by comparing the generated CTF from devel/ctftools. If you find out we do not generate the same data as the reference, then it's a bug. > > While nearly on-topic, is there any definitive docs for CTF? > + typofix for making up the use of bugs@; sorry:) > > -Artturi > > [0] https://www.freebsd.org/cgi/man.cgi?query=ctf > > > diff --git usr.bin/ctfconv/generate.c usr.bin/ctfconv/generate.c > index e19094fe231..299c0d12eb6 100644 > --- usr.bin/ctfconv/generate.c > +++ usr.bin/ctfconv/generate.c > @@ -183,7 +183,7 @@ imcs_add_type(struct imcs *imcs, struct itype *it) > > assert(it->it_type != CTF_K_UNKNOWN && it->it_type != CTF_K_FORWARD); > > - vlen = it->it_nelems; > + vlen = it->it_type != CTF_K_ARRAY ? it->it_nelems : 0; > size = it->it_size; > kind = it->it_type; > root = 0; > diff --git usr.bin/ctfconv/itype.h usr.bin/ctfconv/itype.h > index 408a2140558..c4878f2783e 100644 > --- usr.bin/ctfconv/itype.h > +++ usr.bin/ctfconv/itype.h > @@ -36,7 +36,7 @@ struct itype { > TAILQ_ENTRY(itype) it_symb; /* itype: global queue of symbol */ > RB_ENTRY(itype) it_node; /* itype: per-type tree of types */ > > - SIMPLEQ_HEAD(, itref)it_refs; /* itpye: backpointing refs */ > + SIMPLEQ_HEAD(, itref)it_refs; /* itype: backpointing refs */ > > TAILQ_HEAD(, imember)it_members;/* itype: members of struct/union */ > >
Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c
On 13/05/18(Sun) 23:16, Michael-John Turner wrote: > Hi, > > On Thu, May 10, 2018 at 05:13:17PM +0200, Alexander Bluhm wrote: > > When an IPv6 neigbor discovery timeout occurs, the kernel tries to > > remove the NDP entry. It is stored in the routing table. The > > problem is that this NDP route suddenly has a locally configured > > address. > > Did you perhaps spot anything in the files I made available? The crashes > have continued daily, I'm guessing when the problematic entry in the NDP > table expires. I've tried tweaking various settings and have removed some of > the unusual parts of my setup (moving some of the subnets which shared an > interface onto their own VLANs, for example), but nothing has helped :( > Same panic in the same location. > > Happy to provide any further information that you think may help diagnose > the problem. > > Thanks in advance :) Could you try the diff below and as soon as you see the message in the dmesg, get the output of 'route -n show -inet6' and send us both? Index: nd6.c === RCS file: /cvs/src/sys/netinet6/nd6.c,v retrieving revision 1.224 diff -u -p -r1.224 nd6.c --- nd6.c 2 May 2018 07:19:45 - 1.224 +++ nd6.c 14 May 2018 13:12:03 - @@ -722,7 +722,16 @@ nd6_free(struct rtentry *rt) } } - KASSERT(!ISSET(rt->rt_flags, RTF_LOCAL)); + if (ISSET(rt->rt_flags, RTF_LOCAL)) { + char ip[INET6_ADDRSTRLEN]; + + printf("%s: called for %s on %s\n", __func__, + inet_ntop(AF_INET6, &satosin6(rt_key(rt))->sin6_addr, ip, + sizeof(ip)), + ifp->if_xname); + if_put(ifp); + return; + } nd6_invalidate(rt); /*
firefox 60.0 / "modesetting" / pledge
After upgrading to the last packaged version of firefox, browsing become once again unusable. This time the problem seems due to rendering, as switching back to the "intel" driver made the rendering of the pages normal again. With the "modesetting" driver it takes multiple seconds and scrolling is not smooth. $ pkg_info -q|grep firefox firefox-60.0 On top of that, trying to download a file using the "save as" menu result in a pledge problem: firefox[35328]: pledge "getpw", syscall 33 firefox[35328]: pledge "stdio", syscall 87 firefox[91987]: pledge "getpw", syscall 33 Workaround: $ cat /etc/X11/xorg.conf Section "Device" Identifier "Device0" Driver "intel" EndSection OpenBSD 6.3-current (GENERIC.MP) #38: Wed May 9 17:38:06 MDT 2018 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 8238301184 (7856MB) avail mem = 7980589056 (7610MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xccbfd000 (65 entries) bios0: vendor LENOVO version "N14ET26W (1.04 )" date 01/23/2015 bios0: LENOVO 20BS006BGE acpi0 at bios0: rev 2 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SLIC ASF! HPET ECDT APIC MCFG SSDT SSDT SSDT SSDT SSDT SSDT SSDT SSDT SSDT SSDT PCCT SSDT UEFI MSDM BATB FPDT UEFI DMAR acpi0: wakeup devices LID_(S4) SLPB(S3) IGBE(S4) EXP2(S4) XHCI(S3) EHC1(S3) acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 14318179 Hz acpiec0 at acpi0 acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2295.09 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 1, core 0, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 1, package 0 cpu3 at mainbus0: apid 3 (application processor) cpu3: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 1, core 1, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 40 pins acpimcfg0 at acpi0 addr 0xf800, bus 0-63 acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus -1 (PEG_) acpiprt2 at acpi0: bus 3 (EXP1) acpiprt3 at acpi0: bus 4 (EXP2) acpiprt4 at acpi0: bus -1 (EXP3) acpiprt5 at acpi0: bus -1 (EXP6) acpicpu0 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), C1(1000@1 mwait.1), PSS acpicpu1 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), C1(1000@1 mwait.1), PSS acpicpu2 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), C1(1000@1 mwait.1), PSS acpicpu3 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), C1(1000@1 mwait.1), PSS acpipwrres0 at acpi0: PUBS, resource for XHCI, EHC1 acpipwrres1 at acpi0: NVP3, resource for PEG_ acpipwrres2 at acpi0: NVP2, resource for PEG_ acpitz0 at acpi0: critical temperature is 128 degC acpib
Re: 6.3 just died (not for the first time)
On 16/05/18(Wed) 08:06, Harald Dunkel wrote: > Hi folks, Thanks for the report. > hopefully its allowed to repost this message here: > > One gateway running 6.3 ran into the debugger last night. Last words: > > login: kernel: protection fault trap, code=0 > Stopped at export_sa+0x5c: movl0(%rcx),%ecx > ddb{0}> show panic > the kernel did not panic > ddb{0}> trace > export_sa(10,800033445e70) at export_sa+0x5c > pfkeyv2_expire(813d4c00,813d4c00) at pfkeyv2_expire+0x14e > tdb_timeout(800033446020) at tdb_timeout+0x39 > softclock_thread(0) at softclock_thread+0xc6 > end trace frame: 0x0, count: -4 > ddb{0}> show registers > rdi 0x800033445e98 > rsi 0x813d4c00 > rbp 0x800033445e70 > rbx 0x800033445e98 > rdx 0x81abdff0cpu_info_full_primary+0x1ff0 > rcx 0xdeadbeefdeadbeef ^^ That means that the TDB has already been freed. This is possible because the timeout sleeps on the NET_LOCK(). Diff below should prevent that by introducing a tdb_reaper() function like we do in other parts of the stack. Index: netinet/ip_ipsp.c === RCS file: /cvs/src/sys/netinet/ip_ipsp.c,v retrieving revision 1.229 diff -u -p -r1.229 ip_ipsp.c --- netinet/ip_ipsp.c 6 Nov 2017 15:12:43 - 1.229 +++ netinet/ip_ipsp.c 16 May 2018 08:17:59 - @@ -79,10 +79,11 @@ void tdb_hashstats(void); #endif void tdb_rehash(void); -void tdb_timeout(void *v); -void tdb_firstuse(void *v); -void tdb_soft_timeout(void *v); -void tdb_soft_firstuse(void *v); +void tdb_reaper(void *); +void tdb_timeout(void *); +void tdb_firstuse(void *); +void tdb_soft_timeout(void *); +void tdb_soft_firstuse(void *); inttdb_hash(u_int, u_int32_t, union sockaddr_union *, u_int8_t); int ipsec_in_use = 0; @@ -541,14 +542,13 @@ tdb_timeout(void *v) { struct tdb *tdb = v; - if (!(tdb->tdb_flags & TDBF_TIMER)) - return; - NET_LOCK(); - /* If it's an "invalid" TDB do a silent expiration. */ - if (!(tdb->tdb_flags & TDBF_INVALID)) - pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD); - tdb_delete(tdb); + if (tdb->tdb_flags & TDBF_TIMER) { + /* If it's an "invalid" TDB do a silent expiration. */ + if (!(tdb->tdb_flags & TDBF_INVALID)) + pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD); + tdb_delete(tdb); + } NET_UNLOCK(); } @@ -557,14 +557,13 @@ tdb_firstuse(void *v) { struct tdb *tdb = v; - if (!(tdb->tdb_flags & TDBF_SOFT_FIRSTUSE)) - return; - NET_LOCK(); - /* If the TDB hasn't been used, don't renew it. */ - if (tdb->tdb_first_use != 0) - pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD); - tdb_delete(tdb); + if (tdb->tdb_flags & TDBF_SOFT_FIRSTUSE) { + /* If the TDB hasn't been used, don't renew it. */ + if (tdb->tdb_first_use != 0) + pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD); + tdb_delete(tdb); + } NET_UNLOCK(); } @@ -573,13 +572,12 @@ tdb_soft_timeout(void *v) { struct tdb *tdb = v; - if (!(tdb->tdb_flags & TDBF_SOFT_TIMER)) - return; - NET_LOCK(); - /* Soft expirations. */ - pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT); - tdb->tdb_flags &= ~TDBF_SOFT_TIMER; + if (tdb->tdb_flags & TDBF_SOFT_TIMER) { + /* Soft expirations. */ + pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT); + tdb->tdb_flags &= ~TDBF_SOFT_TIMER; + } NET_UNLOCK(); } @@ -588,14 +586,13 @@ tdb_soft_firstuse(void *v) { struct tdb *tdb = v; - if (!(tdb->tdb_flags & TDBF_SOFT_FIRSTUSE)) - return; - NET_LOCK(); - /* If the TDB hasn't been used, don't renew it. */ - if (tdb->tdb_first_use != 0) - pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT); - tdb->tdb_flags &= ~TDBF_SOFT_FIRSTUSE; + if (tdb->tdb_flags & TDBF_SOFT_FIRSTUSE) { + /* If the TDB hasn't been used, don't renew it. */ + if (tdb->tdb_first_use != 0) + pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT); + tdb->tdb_flags &= ~TDBF_SOFT_FIRSTUSE; + } NET_UNLOCK(); } @@ -841,14 +838,6 @@ tdb_free(struct tdb *tdbp) ipo->ipo_last_searched = 0; /* Force a re-search. */ } - /* Remove expiration timeouts. */ - tdbp->tdb_flags &= ~(TDBF_FIRSTUSE | TDBF_SOFT_FIRSTUSE | TDBF_TIMER | - TDBF_SOFT_TIMER); - timeout_del(&tdbp->tdb_timer_tmo); - timeout_del(&tdbp->tdb_first_tmo);
Re: protection fault after fatfingering address
On 20/05/18(Sun) 21:10, Alexander Bluhm wrote: > On Sun, May 20, 2018 at 07:24:05AM +0200, p...@centroid.eu wrote: > > http://centroid.eu/private/p523.jpg > > ml_enqueue+0x11 > /usr/src/sys/kern/uipc_mbuf.c:1498 > * 33a1: 48 89 71 08 mov%rsi,0x8(%rcx) > 33a5: eb 07 jmp33ae > > 1492 void > 1493 ml_enqueue(struct mbuf_list *ml, struct mbuf *m) > 1494 { > 1495 if (ml->ml_tail == NULL) > 1496 ml->ml_head = ml->ml_tail = m; > 1497 else { > * 1498 ml->ml_tail->m_nextpkt = m; > 1499 ml->ml_tail = m; > 1500 } > 1501 > 1502 m->m_nextpkt = NULL; > 1503 ml->ml_len++; > 1504 } > > arpresolve+0x1bf > /usr/src/sys/netinet/if_ether.c:383 > 954: 4c 89 ffmov%r15,%rdi > 957: 4c 89 e6mov%r12,%rsi > 95a: e8 00 00 00 00 callq 95f > /usr/src/sys/netinet/if_ether.c:384 > *95f: 83 04 25 00 00 00 00addl $0x1,0x0 > >373 la = (struct llinfo_arp *)rt->rt_llinfo; >374 KASSERT(la != NULL); >375 if (la_hold_total < LA_HOLD_TOTAL && la_hold_total < nmbclust > / > 64) { >376 struct mbuf *mh; >377 >378 if (ml_len(&la->la_ml) >= LA_HOLD_QUEUE) { >379 mh = ml_dequeue(&la->la_ml); >380 la_hold_total--; >381 m_freem(mh); >382 } > * 383 ml_enqueue(&la->la_ml, m); >384 la_hold_total++; >385 } else { >386 la_hold_total -= ml_purge(&la->la_ml); >387 m_freem(m); >388 } > > So the kernel crashes when it accesses the mbuf_list in the struct > llinfo_arp. > > > route change default -inet6 2001:db8:0:40::300 > > As the address families of the route is messed up, I guess that the > cast in line 373 is wrong. The data structure is a llinfo_nd6 and > not a llinfo_arp. > > I could not reproduce the crash, but my kernel accepts an IPv6 > gateway for the IPv4 default route. This kernel diff prevents that > user land can add or change such routes. > > root@v74:.../~# route change default -inet6 fdd7:e83e:66bc:74::1234 > change net default: gateway fdd7:e83e:66bc:74::1234: Address family not > supported by protocol family Are you sure this change won't introduce a regression with L2 route entries? These entries generally have a Ethernet address as gateway. In any case it would be nice to add this problem to the route regression test. > Index: net/rtsock.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v > retrieving revision 1.265 > diff -u -p -r1.265 rtsock.c > --- net/rtsock.c 14 May 2018 07:33:59 - 1.265 > +++ net/rtsock.c 20 May 2018 19:02:08 - > @@ -718,6 +718,14 @@ route_output(struct mbuf *m, struct sock > info.rti_flags |= RTF_LLINFO; > } > > + if (info.rti_info[RTAX_DST] != NULL && > + info.rti_info[RTAX_GATEWAY] != NULL && > + info.rti_info[RTAX_DST]->sa_family != > + info.rti_info[RTAX_GATEWAY]->sa_family) { > + error = EAFNOSUPPORT; > + goto fail; > + } > + > /* >* Validate RTM_PROPOSAL and pass it along or error out. >*/ >
Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c
On 17/05/18(Thu) 21:30, Michael-John Turner wrote: > On Mon, May 14, 2018 at 03:13:12PM +0200, Martin Pieuchot wrote: > > Could you try the diff below and as soon as you see the message in the > > dmesg, get the output of 'route -n show -inet6' and send us both? > > It's happened a few times since applying the patch but I've finally managed > to get the route output at the right moment, as requested. The messages in > dmesg_ndp_issue.txt have been flooding the message buffer for the last ~40 > minutes or so. > > As the text may wrap a bit oddly if posted to the list, I've placed the > files here: > http://dl.rsx11.net/misc/dmesg_ndp_issue.txt > http://dl.rsx11.net/misc/ndp_ndp_issue.txt > http://dl.rsx11.net/misc/netstat_ndp_issue.txt > http://dl.rsx11.net/misc/route_ndp_issue.txt > > Any ideas what could be causing the problem? No because you didn't send your dmesg. I need the full dmesg, the important part from your original message was: ndp info overwritten for fe80:d::b408:97aa:a658:760e by 40:85:1b:ab:69:d5 on vlan41 ndp info overwritten for fe80:c::b408:97aa:a658:760e by c8:00:44:93:05:62 on vlan40 I need the corresponding information for the output you provided above. I'm guessing the in-kernel state machine tries to overwrite a RTF_LOCAL address and that should not happen.
Race between dup2(2) and accept(2)
If a process exit(3)s while one of its threads is blocking in accept(2) and the half-opened descriptor has already been dup'ed, we get the following panic: panic: closef: count (1) < 2 Stopped at db_enter+0x5: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND *204115 80020 0 0x10030x80K dup2_accept db_enter() at db_enter+0x5 panic() at panic+0x120 closef(ff000583d948,8000a020) at closef+0x145 doaccept(1e0,8000a020,1e,839a03c0,7f7c9d58,bc7efe80dd509fa1) at doaccept+0x2a3 syscall(1) at syscall+0x31d Xsyscall_untramp(6,0,0,0,0,1e) at Xsyscall_untramp+0xc0 end of kernel A test for this problem can be found there: https://marc.info/?l=openbsd-tech&m=152637351632752&w=2 Diff below prevents the problem by returning EBUSY in dup2(2) & friends like Linux does when trying to dup an half-opened file. I'd like to reuse this logic to keep the future locking simple, ok? Index: sys/kern/kern_descrip.c === RCS file: /cvs/src/sys/kern/kern_descrip.c,v retrieving revision 1.158 diff -u -p -r1.158 kern_descrip.c --- sys/kern/kern_descrip.c 8 May 2018 09:03:58 - 1.158 +++ sys/kern/kern_descrip.c 21 May 2018 12:12:50 - @@ -634,13 +634,14 @@ finishdup(struct proc *p, struct file *f return (EDEADLK); } - /* -* Don't fd_getfile here. We want to closef LARVAL files and -* closef can deal with that. -*/ oldfp = fdp->fd_ofiles[new]; - if (oldfp != NULL) + if (oldfp != NULL) { + if (!FILE_IS_USABLE(oldfp)) { + FRELE(fp, p); + return (EBUSY); + } FREF(oldfp); + } fdp->fd_ofiles[new] = fp; fdp->fd_ofileflags[new] = fdp->fd_ofileflags[old] & ~UF_EXCLOSE; Index: lib/libc/sys//dup.2 === RCS file: /cvs/src/lib/libc/sys/dup.2,v retrieving revision 1.18 diff -u -p -r1.18 dup.2 --- lib/libc/sys//dup.2 10 Dec 2014 19:46:48 - 1.18 +++ lib/libc/sys//dup.2 21 May 2018 12:12:38 - @@ -157,6 +157,10 @@ is not a valid active descriptor or is negative or greater than or equal to the process's .Dv RLIMIT_NOFILE limit. +.It Bq Er EBUSY +A race condition with +.Xr accept 2 +has been detected. .It Bq Er EINTR An interrupt was received. .It Bq Er EIO
Re: protection fault trap with OpenBSD 6.3
On 28/05/18(Mon) 22:24, Marc Peters wrote: > Hi List, > > i am having issues with OpenBSD 6.3, latest patches as of today applied. We > are using gif-tunnels between our datacenters, transport encryption and > OpenBGPD to announce the prefixes between the datacenters. The boxes also > have isakmpd tunnels on a carp interface to AWS and GCP. The setup is working > fine with existing 6.1 boxes and there's no problem in pushing/receiving > several 100MBit/s (according to observium snmpd data, which gets constantly > collected). Switching the traffic to the 6.3 hosts, we get a freeze on one of > the boxes after about 45 minutes of transferring traffic (all IPv4 traffic in > our case for now): This has been fixed in -current.
Re: bsd.mp hits witness panic under vmm (single CPU)
On 07/06/18(Thu) 19:22, Philip Guenther wrote: > On Thu, 7 Jun 2018, Mike Larkin wrote: > > Is this a panic inside the guest in vmm, or is this the host panicing when > > you're doing something while a VM is running in vmm on that host? > > > > Can't really tell from the trace here... > > This was a guest panicing. visa@ thinks this is the same intr_legacy8 > panic as reported previously. It is. This is not a new issue. We know legacy interrupts are not mpsafe.
Re: Assertion failure when adding point-to-point routes to interfaces in rdomain with deleted loopback
On 06/06/18(Wed) 16:21, multiplexd wrote: > >Synopsis: Assertion failure when adding point-to-point routes to > >interfaces in rdomain with deleted loopback > >Category: Reliability > >Environment: > System : OpenBSD 6.3 > Details : OpenBSD 6.3 (GENERIC) #3: Thu May 17 23:54:13 CEST 2018 > > r...@syspatch-63-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC > > Architecture: OpenBSD.amd64 > Machine : amd64 (see Description) > >Description: > > Adding a route to a point-to-point interface such as gre(4) or tun(4) where > the interface is in a > non-default rdomain and the loopback device for the given rdomain has been > destroyed will trigger a > kernel assertion failure, causing a system crash. > > This issue has been observed and reproduced on both an amd64 system (virtual > machine on a Debian 9 > host) and a macppc system (iBook G4). > > >How-To-Repeat: > > 1) Create a new loopback device in a non-default rdomain. Example: > > # ifconfig lo2 rdomain 2 > > 2) The following two steps can be performed in any order. > 2a) Create a point-to-point interface. The following example creates a new > tun(4) interface, > though this has also been reproduced with a gre(4) interface. > > # ifconfig tun0 rdomain 2 > > 2b) Delete the loopback device associated with the rdomain. > > # ifconfig lo2 -rdomain destroy > > 3) Add a route to the point-to-point interface, e.g. > > # ifconfig tun0 inet 192.168.200.1 192.168.200.2 > >The system will crash and drop to a ddb(4) prompt. > > An example session is shown below: > > bsd00# ifconfig lo2 rdomain 2 > bsd00# ifconfig tun0 rdomain 2 > bsd00# ifconfig lo2 -rdomain destroy > bsd00# ifconfig tun0 inet 192.168.200.1 192.168.200.2 > panic: kernel diagnostic assertion "lo0ifp != NULL" failed: file > "/usr/src/sys/net/if.c", line 1483 Thanks for the report, could you try the diff below? Index: net/if.c === RCS file: /cvs/src/sys/net/if.c,v retrieving revision 1.554 diff -u -p -r1.554 if.c --- net/if.c30 May 2018 22:20:41 - 1.554 +++ net/if.c14 Jun 2018 12:36:20 - @@ -1765,9 +1765,11 @@ if_setrdomain(struct ifnet *ifp, int rdo if (rdomain != rtable_l2(rdomain)) return (EINVAL); - /* remove all routing entries when switching domains */ - /* XXX this is a bit ugly */ if (rdomain != ifp->if_rdomain) { + if ((ifp->if_flags & IFF_LOOPBACK) && + (ifp->if_index == rtable_loindex(ifp->if_rdomain))) + return (EPERM); + s = splnet(); /* * We are tearing down the world.
Re: Assertion failure when adding point-to-point routes to interfaces in rdomain with deleted loopback
On 16/06/18(Sat) 23:31, multiplexd wrote: > [...] > As a supplementary question, is it intended that (non-default) rdomains > cannot be "deleted" at runtime after they have been created? Let's say that deletion hasn't been implemented.
Re: kernel_lock not locked
On 28/06/18(Thu) 14:53, Visa Hankala wrote: > On Wed, Jun 27, 2018 at 08:46:04PM +0200, Landry Breuil wrote: > > On Wed, Jun 27, 2018 at 05:37:54PM +0100, Laurence Tratt wrote: > > > >Synopsis:kernel_lock not locked > > > >Category:kernel > > > >Environment: > > > System : OpenBSD 6.3 > > > Details : OpenBSD 6.3-current (GENERIC.MP) #55: Mon Jun 25 23:01:52 > > > MDT 2018 > > > > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > > > Architecture: OpenBSD.amd64 > > > Machine : amd64 > > > >Description: > > > I just hit the following kernel panic (a locking error in sched_bsd.c): > > > > > > https://imagebin.ca/v/46kV6Tfqe1sc > > > > > > I can hit this repeatedly by gdb'ing the new quodlibet 4.1.0 update that > > > Stuart just pushed to ports. It crashes at load; exactly at the point I > > > quit gdb the kernel panics. Here's the userland trace I get just before > > > the kernel panic occurs: > > > > Fwiw, i've hit a similar panic (kernel_lock not locked) this weekend (on an > > up > > to date kernel) when using egdb on ... firefox, of course. > > There is a locking bug that gets triggered when a traced and stopped > multithreaded process is forced to exit. When the bug hits, a thread > calls exit1() with the kernel locked recursively: > > sched_exit > exit1 > single_thread_check > single_thread_set > issignal <-- KERNEL_LOCK() > userret <-- KERNEL_LOCK() > syscall > Xsyscall_untramp > > sched_exit() assumes that a single KERNEL_UNLOCK() releases the lock > completely. However, the assumption is wrong in the above case. > sched_exit() switches to the CPU's idle thread, which in turn calls > mi_switch(). Then, mi_switch() tries to release the kernel lock (which > is bound to the CPU, and which should not be locked in the first place). > That causes a panic with WITNESS because WITNESS had associated the lock > with the exiting thread and the lock is not found in the idle thread's > lock list. That is why the panic's stack trace looks peculiar: > > panic > witness_unlock > ___mp_release_all > mi_switch > sched_idle > > Without WITNESS, the system would hang soon instead. > > The bug can be fixed by making sched_exit() release the kernel lock > completely. That would also make exit1() more agnostic with regard to > the state of the lock. As an alternative, issignal() could avoid the > recursive locking. > > Comments? OK? Thanks for your analyze. So this is a regression introduced by the fix for the previous TOCTOU race. The kernel is currently grabbing the KERNEL_LOCK() in userret() to serialize access to `ps_sigact'. In the future we'll want to use finer locks. So my question is which fix goes in that direction? The one you posted or not grabbing the KERNEL_LOCK() in userret()? If it doesn't matter, then I believe you should commit your fix, it is ok mpi@. > Index: kern/kern_sched.c > === > RCS file: src/sys/kern/kern_sched.c,v > retrieving revision 1.48 > diff -u -p -r1.48 kern_sched.c > --- kern/kern_sched.c 19 Jun 2018 19:29:52 - 1.48 > +++ kern/kern_sched.c 28 Jun 2018 13:47:28 - > @@ -218,8 +218,11 @@ sched_exit(struct proc *p) > > LIST_INSERT_HEAD(&spc->spc_deadproc, p, p_hash); > > +#ifdef MULTIPROCESSOR > /* This process no longer needs to hold the kernel lock. */ > - KERNEL_UNLOCK(); > + KERNEL_ASSERT_LOCKED(); > + __mp_release_all(&kernel_lock); > +#endif > > SCHED_LOCK(s); > idle = spc->spc_idleproc; >
Re: panic: vinvalbuf: dirty bufs
On 06/07/18(Fri) 12:49, Alexander Bluhm wrote: > On Mon, May 07, 2018 at 05:21:19PM +0200, Alexander Bluhm wrote: > > panic: vinvalbuf: dirty bufs > > At least I know what is going on here. > > vinvalbuf() calls ffs_fsync() to write all dirty buffers of the > mount point to disk. > > if ((error = VOP_FSYNC(vp, cred, MNT_WAIT, p)) != 0) > return (error); > > ffs_fsync() does this successfully and verifies that there are no > dirty blocks left. > > if (!LIST_EMPTY(&vp->v_dirtyblkhd)) { > > But then it calls ufs_update() to write the inode to disk. It waits > until the disk operation has finished. > > return (UFS_UPDATE(VTOI(vp), ap->a_waitfor == MNT_WAIT)); > > My test is still running a cp -r and rm -rf operating on the file > system. While bread() or bwrite() sleeps in the unmount process, > the rm process inserts a new dirty block into the vnode's list. So we might need a barrier or a delayed free to fix this problem. It would be nice to know where are the 'cp' and 'rm' process blocking when the 'unmount' process goes to sleep. You could put a break before UFS_UPDATE() and use 'ps /up 0t$PID' to get this information. Another interesting piece of information is if at least one of the two processes already have a reference to `i_devvp'.
Re: Kernel panic: "kernel page fault", "uvm_fault(...)", "x86_ipi_db(...)"
On 20/07/18(Fri) 03:12, Mike Larkin wrote: > On Wed, Jul 18, 2018 at 11:34:41PM +, Romain wrote: > > > I'm wondering if this is due to the fact that we detach usb(4) devices on > > > suspend. Looks like this may be trying to process a timeout that > > > corresponds > > > to a device that is no longer attached. Maybe the urtwn(4)? Well the device is detaching just after re-attaching. So it must be something different. But I agree with your assumption that it is related to urtwn(4). The problem seems to be a use-after-free of a timeout. The question is which timeout? Is it in urtwn(4)? In ic/rtwn.c? In the wireless stack? In the network stack? Our timeout_add(9) interface is simple but doesn't help to debug such issue.
Re: Kernel panic: "kernel page fault", "uvm_fault(...)", "x86_ipi_db(...)"
On 20/07/18(Fri) 14:32, Theo de Raadt wrote: > Martin Pieuchot wrote: > > > On 20/07/18(Fri) 03:12, Mike Larkin wrote: > > > On Wed, Jul 18, 2018 at 11:34:41PM +, Romain wrote: > > > > > I'm wondering if this is due to the fact that we detach usb(4) > > > > > devices on > > > > > suspend. Looks like this may be trying to process a timeout that > > > > > corresponds > > > > > to a device that is no longer attached. Maybe the urtwn(4)? > > > > Well the device is detaching just after re-attaching. So it must be > > something different. But I agree with your assumption that it is > > related to urtwn(4). > > > > The problem seems to be a use-after-free of a timeout. The question is > > which timeout? Is it in urtwn(4)? In ic/rtwn.c? In the wireless stack? > > In the network stack? > > > > Our timeout_add(9) interface is simple but doesn't help to debug such > > issue. > > Is it a timeout not removed during detach? That might be that or a timeout re-attached after being removed because there's a race somewhere... That's not the only place where we have such problem. If somebody has an idea or a floating diff to ease timeout debugging, that's the moment to speak (:
Re: uaudio device works on usb2 port; fails on usb3 port
On 18/08/20(Tue) 18:53, Marcus Glocker wrote: > On Wed, 12 Aug 2020 21:39:15 +0200 > Marcus Glocker wrote: > > > jmc was so nice to send me his trouble device over to do some further > > investigations. Just some updates on what I've noticed today: > > > > - The issue isn't specific to xhci(4). I also see the same issue on > > some of my ehci(4) machines when attaching this device. > > > > - It seems like the device gets in to an 'corrupted state' after > > running a couple of control transfer against it. Initially they > > work fine, with smaller and larger transfer sizes, and at one point > > the device hangs up and doesn't recover until re-attaching it. > > While on some ehci(4) machines the uhidev(4) attach works fine, after > > running lsusb against the device, I see transfer errors coming up > > again; On xhci(4) namely XHCI_CODE_TXERR. > > > > - Attaching an USB 2.0 hub doesn't make any difference, no matter if > > attached to an xhci(4) or an ehci(4) controller. > > > > Not sure what is going wrong with this little beast ... > > OK, I give up :-) Following my summary report. > > This device seems to have issues with two control request types: > > - UR_GET_STATUS, not called for this device from the kernel in the > default code path. But e.g. 'lsusb -v' will call it. > > - UR_SET_IDLE, as called in uhidev_attach(). > > UR_GET_STATUS will stall the device for good on *all* controller > drivers. Does this also happen when the device attaches as ugen(4)? If yes that would rules out concurrency issues that might happen when using lsusb(1) while other transfers are in fly. To test you need to disable the current attaching driver in ukc. > UR_SET_IDLE works only on ehci(4) - Don't ask me why. > On all the other controller drivers the following UR_GET_REPORT request > will fail, stalling the device as well. I tried all kind of things to > get the UR_SET_IDLE request working on xhci(4), but without any luck. Does the device respond to GET_IDLE? It it a timing problem? How much time does the device need to be idle? Does introducing a delay before and/or after usbd_set_idle() change the behavior? Did you try passing a non-0 duration parameter to the SET_IDLE command? Taking a step back, why does a uaudio(4) needs a UR_SET_IDLE? This tells the device to only respond to IN interrupt transfers when new events occur, right? Does all devices attaching to uhidev want this behavior? > The good news is that when we skip the UR_SET_IDLE request on xhci(4), > the following UR_GET_REPORT request works, and isoc transfers also work > perfectly fine. You can use the device for audio streaming. > > Therefore the only thing I can offer is a quirk to skip the > UR_SET_IDLE request when attaching this device. On ehci(4) the > device continues to work as before with this quirk. Therefore I > didn't include any code to only apply the quirk on non-ehci > controllers. > > I know it's not a nice solution, but at least it makes this device > usable on xhci(4) while not impacting other things. Maybe it is a step towards a real solution. Should usbd_set_idle() stay in uhidev(4) or, if it doesn't make sense for all devices, should we move it in child drivers like ukbd(4), etc? > If anyone is OK with that and has no better idea how to fix it, I'm > happy to commit. > > Cheers, > Marcus > > > Index: uhidev.c > === > RCS file: /cvs/src/sys/dev/usb/uhidev.c,v > retrieving revision 1.80 > diff -u -p -u -p -r1.80 uhidev.c > --- uhidev.c 31 Jul 2020 10:49:33 - 1.80 > +++ uhidev.c 18 Aug 2020 13:36:13 - > @@ -151,7 +151,8 @@ uhidev_attach(struct device *parent, str > sc->sc_ifaceno = uaa->ifaceno; > id = usbd_get_interface_descriptor(sc->sc_iface); > > - usbd_set_idle(sc->sc_udev, sc->sc_ifaceno, 0, 0); > + if (!(usbd_get_quirks(uaa->device)->uq_flags & UQ_NO_SET_IDLE)) > + usbd_set_idle(sc->sc_udev, sc->sc_ifaceno, 0, 0); > > sc->sc_iep_addr = sc->sc_oep_addr = -1; > for (i = 0; i < id->bNumEndpoints; i++) { > Index: usb_quirks.c > === > RCS file: /cvs/src/sys/dev/usb/usb_quirks.c,v > retrieving revision 1.76 > diff -u -p -u -p -r1.76 usb_quirks.c > --- usb_quirks.c 5 Jan 2020 00:54:13 - 1.76 > +++ usb_quirks.c 18 Aug 2020 13:36:13 - > @@ -52,6 +52,7 @@ const struct usbd_quirk_entry { > u_int16_t bcdDevice; > struct usbd_quirks quirks; > } usb_quirks[] = { > + { USB_VENDOR_MICROCHIP, USB_PRODUCT_MICROCHIP_SOUNDKEY, ANY, { > UQ_NO_SET_IDLE }}, { USB_VENDOR_KYE, USB_PRODUCT_KYE_NICHE, > 0x100, { UQ_NO_SET_PROTO}}, { USB_VENDOR_INSIDEOUT, > USB_PRODUCT_INSIDEOUT_EDGEPORT4, 0x094, { UQ_SWAP_UNICODE}}, > Index: usb_quirks.h > === > RCS file: /cvs/src/sys/dev/usb/usb_quirks.h,v > retrieving r
Re: uaudio device works on usb2 port; fails on usb3 port
On 21/08/20(Fri) 11:46, Marcus Glocker wrote: > On Wed, 19 Aug 2020 20:31:05 +0200 > Marcus Glocker wrote: > > > On Wed, Aug 19, 2020 at 01:21:35PM +0200, Marcus Glocker wrote: > > > > > On Wed, 19 Aug 2020 12:02:23 +0200 > > > Martin Pieuchot wrote: > > > > > > > On 18/08/20(Tue) 18:53, Marcus Glocker wrote: > > > > > On Wed, 12 Aug 2020 21:39:15 +0200 > > > > > Marcus Glocker wrote: > > > > > > > > > > > jmc was so nice to send me his trouble device over to do some > > > > > > further investigations. Just some updates on what I've > > > > > > noticed today: > > > > > > > > > > > > - The issue isn't specific to xhci(4). I also see the same > > > > > > issue on some of my ehci(4) machines when attaching this > > > > > > device. > > > > > > > > > > > > - It seems like the device gets in to an 'corrupted state' > > > > > > after running a couple of control transfer against it. > > > > > > Initially they work fine, with smaller and larger transfer > > > > > > sizes, and at one point the device hangs up and doesn't > > > > > > recover until re-attaching it. While on some ehci(4) machines > > > > > > the uhidev(4) attach works fine, after running lsusb against > > > > > > the device, I see transfer errors coming up again; On > > > > > > xhci(4) namely XHCI_CODE_TXERR. > > > > > > > > > > > > - Attaching an USB 2.0 hub doesn't make any difference, no > > > > > > matter if attached to an xhci(4) or an ehci(4) controller. > > > > > > > > > > > > Not sure what is going wrong with this little beast ... > > > > > > > > > > OK, I give up :-) Following my summary report. > > > > > > > > > > This device seems to have issues with two control request types: > > > > > > > > > > - UR_GET_STATUS, not called for this device from the kernel > > > > > in the default code path. But e.g. 'lsusb -v' will call it. > > > > > > > > > > - UR_SET_IDLE, as called in uhidev_attach(). > > > > > > > > > > UR_GET_STATUS will stall the device for good on *all* controller > > > > > drivers. > > > > > > > > Does this also happen when the device attaches as ugen(4)? If yes > > > > that would rules out concurrency issues that might happen when > > > > using lsusb(1) while other transfers are in fly. To test you > > > > need to disable the current attaching driver in ukc. > > > > > > Yes, it does also happen when attaching the device to ugen(4). > > > But honestly, I was playing around yesterday evening a bit further > > > with this device, and I noticed that the device also stalls with > > > lsusb when I remove the get status and get report request in the > > > lsusb code. > > > > > > Therefore I need to correct my statement, saying instead that *some* > > > request in lsusb makes the device stall as well. What I just found > > > in the lsusb ChangeLog: > > > > > > Added (somewhat dummy) Set_Protocol and Set_Idle requests to > > > stream dumping setup. > > > > > > I'll try to confirm if the stall really happens there. At least > > > that would be in line with our findings in the kernel. > > > > OK, I've tracked the two lsusb requests down finally which also stall > > this device beside our set idle call in the kernel. > > > > UR_GET_DESCRIPTOR, UDESC_DEVICE_QUALIFIER: > > > > ret = usb_control_msg(fd, LIBUSB_ENDPOINT_IN | > > LIBUSB_REQUEST_TYPE_STANDARD | LIBUSB_RECIPIENT_DEVICE, > > LIBUSB_REQUEST_GET_DESCRIPTOR, > > USB_DT_DEBUG << 8, 0, > > buf, sizeof buf, CTRL_TIMEOUT); > > > > UR_GET_DESCRIPTOR, UDESC_DEBUG: > > > > ret = usb_control_msg(fd, LIBUSB_ENDPOINT_IN | > > LIBUSB_REQUEST_TYPE_STANDARD | LIBUSB_RECIPIENT_DEVICE, > > LIBUSB_REQUEST_GET_DESCRIPTOR, > > USB_DT_DEBUG << 8, 0, > > buf, sizeof buf, CTRL_TIMEOUT); > > > > When you comment those two control requests out, lsusb -v runs > > through. > > > > If I wouldn't know better, I would say that this device
Re: VPS crash to kernel panic on boot
On 25/11/20(Wed) 19:41, AIsha Tammy wrote: > Replicable bug that has happened from sysupgrading to snapshot. > VPS was working perfectly until this sysupgrade. > > VPS boots - drops to kernel panic ddb > > Seems to be some mutex issue? > Had to manually copy information cuz weird web console, so my apologies > if this isn't enough information. What is the date of the snapshots? If you can reproduce this could you give us the output of the "trace" command? Thanks, Martin
Re: kernel panic when removing interface
On 24/11/20(Tue) 09:23, Pierre Emeriaud wrote: > > Trying to use mgre(4), I found what looks like a reliable way to crash > > the kernel which might be of interest. > > > > This machine is a one-month-old-current fairly light router, with inet > > default within rdomain 1. I will upgrade to a more recent snap > > shortly. > > I just upgraded to OpenBSD 6.8-current (GENERIC) #181: Mon Nov 23 > 20:55:15 MST 2020 and the same thing happens with vlan(4): > > $ doas ifconfig vlan12 inet 192.0.2.1/24 parent vio0 vnetid 12 > $ ifconfig vlan > vlan12: flags=8843 mtu 1500 > lladdr 02:00:00:ef:3d:d7 > index 8 priority 0 llprio 3 > encap: vnetid 12 parent vio0 txprio packet rxprio outer > groups: vlan > media: Ethernet autoselect > status: active > inet 192.0.2.1 netmask 0xff00 broadcast 192.0.2.255 > > $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12 I wonder if the problem isn't in the validation of these parameters. Should we accept a L2 (-link) entry on a routing table which isn't the routing domain? If so why does the entry persist in the ARP cache? Can you reproduce the problem if you don't specify T1? > add host 192.0.2.2/32: gateway vlan12 > > $ route -T1 -n show -inet > DestinationGatewayFlags Refs Use Mtu Prio Iface > 192.0.2.2 link#8 UHLS 00 - 8 vlan12 > > $ route -n show -inet > Internet: > DestinationGatewayFlags Refs Use Mtu Prio Iface > 192.0.2/24 192.0.2.1 UCn00 - 4 vlan12 > 192.0.2.1 02:00:00:ef:3d:d7 UHLl 00 - 1 vlan12 > 192.0.2.255192.0.2.1 UHb00 - 1 vlan12 > > $ doas ifconfig vlan12 down > $ doas ifconfig vlan12 destroy > > $ route -T1 -n show -inet > DestinationGatewayFlags Refs Use Mtu Prio Iface > 192.0.2.2 link#8 UHLS 00 - 8 (null) > > $ doas route -T1 del 192.0.2.2/32 > > login: panic: kernel diagnostic assertion "ifp != NULL" failed: file > "/usr/src/sys/net/rtsock.c", line 975 > Stopped at db_enter+0x10: popq%rbp > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > *189431 84402 00x13 00 route > db_enter() at db_enter+0x10 > panic(81dcc1d7) at panic+0x12a > __assert(81e32678,81e40e69,3cf,81d9f5fd) at > __assert+0x > 2b > rtm_output(80071480,8e77ce80,8e77cdd8,40,1) at > rtm_outp > ut+0x7ee > route_output(fd801ef36c00,fd801af0d698,0,0) at route_output+0x3c3 > route_usrreq(fd801af0d698,9,fd801ef36c00,0,0,8e720540) at > route > _usrreq+0x21a > sosend(fd801af0d698,0,8e77d0d8,0,0,0) at sosend+0x35b > dofilewritev(8e720540,3,8e77d0d8,0,8e77d1b0) at > dofilew > ritev+0x14d > sys_write(8e720540,8e77d150,8e77d1b0) at > sys_write+0x51 > > syscall(8e77d220) at syscall+0x315 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7d35b0, count: 4 > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. > ddb> >
Re: VPS crash to kernel panic on boot
On 26/11/20(Thu) 09:21, AIsha Tammy wrote: > On 11/26/20 6:51 AM, Martin Pieuchot wrote: > > On 25/11/20(Wed) 19:41, AIsha Tammy wrote: > >> Replicable bug that has happened from sysupgrading to snapshot. > >> VPS was working perfectly until this sysupgrade. > >> > >> VPS boots - drops to kernel panic ddb > >> > >> Seems to be some mutex issue? > >> Had to manually copy information cuz weird web console, so my apologies > >> if this isn't enough information. > > What is the date of the snapshots? If you can reproduce this could you > > give us the output of the "trace" command? > > > > Thanks, > > Martin > > > > Yes, reproducible crashes on multiple reboots. Thanks, the diff below should fix it, could you test it? Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.151 diff -u -p -r1.151 uvm_page.c --- uvm/uvm_page.c 24 Nov 2020 13:49:09 - 1.151 +++ uvm/uvm_page.c 26 Nov 2020 17:17:55 - @@ -180,7 +180,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr TAILQ_INIT(&uvm.page_active); TAILQ_INIT(&uvm.page_inactive_swp); TAILQ_INIT(&uvm.page_inactive_obj); - mtx_init(&uvm.pageqlock, IPL_NONE); + mtx_init(&uvm.pageqlock, IPL_VM); mtx_init(&uvm.fpageqlock, IPL_VM); uvm_pmr_init();
Re: kernel panic when removing interface
On 26/11/20(Thu) 20:38, Pierre Emeriaud wrote: > Hello Martin > > Le jeu. 26 nov. 2020 à 14:27, Martin Pieuchot a écrit : > > > > > > > > $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12 > > > > I wonder if the problem isn't in the validation of these parameters. > > > > Should we accept a L2 (-link) entry on a routing table which isn't the > > routing domain? If so why does the entry persist in the ARP cache? > > Which arp entry are you referring to? The one from the route I added? Yes. In the kernel ARP entries are represented as route entries. So when you add a "-link" route it is an ARP entry. > > Can you reproduce the problem if you don't specify T1? > > No. The routes are correctly removed when the interface is destroyed. > It only crashes when the routes are added to another (non-empty if > that matters) rdomain, but again, this was a silly mistake on my side. Still, silly mistakes should be prevented and not crash the kernel ;) > I reported it as it might be of interest to fix this for the sake of > it, but it causes almost no harm. It is, I guess a fix should go in net/rtsock.c to prevent adding "-link" entry on routing table different from ifp->if_rdomain. > PS: I've managed to crash my first router just by waiting a few > seconds - no need to remove the route - same thing as the second > router: > ddb> show panic > kernel diagnostic assertion "ifp != NULL" failed: file > "/usr/src/sys/netinet/if > _ether.c", line 718 > > ddb> trace > db_enter() at db_enter+0x10 > panic(81dc761f) at panic+0x12a > __assert(81e321c2,81db9f2b,2ce,81d9e429) at > __assert+0x > 2b > arp_rtrequest(fd800baa10a8,fd800baa10a8,fd801aa63dc0) at > arp_rtrequ > est > arptimer(8216a090) at arptimer+0x67 > softclock_thread(8000ea40) at softclock_thread+0x13f > end trace frame: 0x0, count: -6
Re: kernel panic when removing interface
On 27/11/20(Fri) 15:47, Denis Fondras wrote: > > It is, I guess a fix should go in net/rtsock.c to prevent adding "-link" > > entry on routing table different from ifp->if_rdomain. > > > > I came up with this, which is more radical. Which is not exactly what we want. This will prevent adding any route on a routing table different from rdomain. What needs to be enforced is the check from a request coming from userland trying to insert a "-link" route. Such check should have the benefit of documenting that L2 entries should be only inserted in the rdomain table of an interface. > Index: route.c > === > RCS file: /cvs/src/sys/net/route.c,v > retrieving revision 1.397 > diff -u -p -r1.397 route.c > --- route.c 29 Oct 2020 21:15:27 - 1.397 > +++ route.c 27 Nov 2020 09:39:53 - > @@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i > return (EINVAL); > ifa = info->rti_ifa; > ifp = ifa->ifa_ifp; > + if (tableid != ifp->if_rdomain) > + return (EINVAL); > if (prio == 0) > prio = ifp->if_priority + RTP_STATIC; > >
Re: 6.8 GENERIC MP#1 Kernel panic on ASUS VivoBook S510U
Thanks for the report. On 21/12/20(Mon) 17:00, Aning wrote: > It's the second mail i try to send to mailing list. After 12 hours i still > can't view the first one on marc.info > It have 15 photo attachments, but all mail was less than 25 mg. Often > protonmail responds when email wasn't received, but not this time. > I hope this gives me excuse to upload screen photos onto mega.co.nz, sorry i > have not established my own email service and ftp yet. > > Anyway here all the screen photos of ddb: > https://mega.nz/folder/9cwCzLIL#CymzilZEOzuA9ugLPKiVeA It seems that sleep_finish() is called with a mutex held. If you can hit this panic again, could you try to type "ps /o" after getting the "trace". >From the output it is not clear which thread is running and since the trace stops (starts) at sleep_finish(), I can't figure out which code path we're dealing with.
Re: top over SSH runaway after network drop
Hello, On 24/12/20(Thu) 12:35, th...@liquidbinary.com wrote: > >Synopsis:If network drops while running top over SSH, runaway process > >Category:minor, poor handling of failure mode > >Environment: > System : OpenBSD 6.7 > Details : OpenBSD 6.7 (GENERIC) #5: Wed Oct 28 00:25:20 MDT 2020 > > t...@syspatch-67-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > If I SSH into any of various amd64 OpenBSD servers, virtual or > physical, > if m running a monitoring process like top, or multitail -f, on a remote > machine > over SSH and the network drops or client machine disconnects, the server > process > consumes nearly 100% of CPU and does not stop itself. I can log back in and > kill the process, but until I do I have a CPU being consumed. This affects > performance, possibly costing money on a virtual server. This behavior is > years old. Did you try to reproduce this bug on -current? Is it still there? If it is, could you please ktrace(1) the program consuming 100% of CPU before killing it? Then add the kdump(1) output to this bug report so we have an idea of what it is doing and hopefully what needs to be fixed Thanks for your report
firefox pledge violation
Firefox from -current, tab crashes, kernel says: firefox[86270]: pledge "", syscall 289 Trace is: #0 shmget () at /tmp/-:3 #1 0x0b38d9347d7b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #2 0x0b38d994ac4b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #3 0x0b38d8c79eb0 in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #4 0x0b38d8c7aa2b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #5 0x0b38d8ce44ed in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #6 0x0b38d8ce553e in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #7 0x0b38d8c7bfa1 in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #8 0x0b38d925495a in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so #9 0x0b3808396dea in drisw_bind_context (context=0xb37ef35a600, old=, draw=, read=) at /usr/xenocara/lib/mesa/mk/libGL/../../src/glx/drisw_glx.c:394 #10 0x0b380839b30e in MakeContextCurrent (dpy=0xb38afd6, draw=14680067, read=14680067, gc_user=0xb37ef35a600) at /usr/xenocara/lib/mesa/mk/libGL/../../src/glx/glxcurrent.c:220 #11 0x0b38c7109b3a in mozilla::gl::GLContextGLX::MakeCurrentImpl() const () from /usr/local/lib/firefox/libxul.so.99.0 #12 0x0b38c7113f1a in mozilla::gl::GLContext::InitImpl() () from /usr/local/lib/firefox/libxul.so.99.0 #13 0x0b38c7113e58 in mozilla::gl::GLContext::Init() () from /usr/local/lib/firefox/libxul.so.99.0 #14 0x0b38c7109aab in mozilla::gl::GLContextGLX::Init() () from /usr/local/lib/firefox/libxul.so.99.0 ---Type to continue, or q to quit--- #15 0x0b38c71098e5 in mozilla::gl::GLContextGLX::CreateGLContext(mozilla::gl::GLContextDesc const&, _XDisplay*, unsigned long, __GLXFBConfigRec*, bool, gfxXlibSurface*) () from /usr/local/lib/firefox/libxul.so.99.0 #16 0x0b38c710a8bc in mozilla::gl::GLContextProviderGLX::CreateHeadless(mozilla::gl::GLContextCreateDesc const&, nsTSubstring*) () from /usr/local/lib/firefox/libxul.so.99.0 #17 0x0b38c80d977b in mozilla::WebGLContext::CreateAndInitGL(bool, std::__1::vector >*) () from /usr/local/lib/firefox/libxul.so.99.0 #18 0x0b38c80da009 in mozilla::WebGLContext::Create(mozilla::HostWebGLContext&, mozilla::webgl::InitContextDesc const&, mozilla::webgl::InitContextResult*) () from /usr/local/lib/firefox/libxul.so.99.0 #19 0x0b38c80699c1 in mozilla::ClientWebGLContext::CreateHostContext(mozilla::avec2 const&) () from /usr/local/lib/firefox/libxul.so.99.0 #20 0x0b38c806c502 in mozilla::ClientWebGLContext::SetDimensions(int, int) () from /usr/local/lib/firefox/libxul.so.99.0 #21 0x0b38c80677d7 in mozilla::dom::CanvasRenderingContextHelper::UpdateContext(JSContext*, JS::Handle, mozilla::ErrorResult&) () from /usr/local/lib/firefox/libxul.so.99.0 #22 0x0b38c8067579 in mozilla::dom::CanvasRenderingContextHelper::GetContext(JSContext*, nsTSubstring const&, JS::Handle, mozilla::ErrorResult&) () from /usr/local/lib/firefox/libxul.so.99.0 #23 0x0b38c7f48113 in mozilla::dom::HTMLCanvasElement_Binding::getContext(JS Context*, JS::Handle, void*, JSJitMethodCallArgs const&) () from /usr/local/lib/firefox/libxul.so.99.0 #24 0x0b38c80034cc in bool mozilla::dom::binding_detail::GenericMethod(JSContext*, unsigned int, JS::Value*) () from /usr/local/lib/firefox/libxul.so.99.0 #25 0x0b38ca7695e5 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) () from /usr/local/lib/firefox/libxul.so.99.0 #26 0x0b38ca765cbb in Interpret(JSContext*, js::RunState&) () from /usr/local/lib/firefox/libxul.so.99.0 #27 0x0b38ca75c022 in js::RunScript(JSContext*, js::RunState&) () from /usr/local/lib/firefox/libxul.so.99.0 #28 0x0b38ca7696ec in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) () from /usr/local/lib/firefox/libxul.so.99.0 #29 0x0b38ca769e2a in js::Call(JSContext*, JS::Handle, JS::Handle, js::AnyInvokeArgs const&, JS::MutableHandle, js::CallReason) () from /usr/local/lib/firefox/libxul.so.99.0 #30 0x0b38cad6ae6d in js::jit::InvokeFunction(JSContext*, JS::Handle, bool, bool, unsigned int, JS::Value*, JS::MutableHandle) () from /usr/local/lib/firefox/libxul.so.99.0 #31 0x0b38cad6b20a in js::jit::InvokeFromInterpreterStub(JSContext*, js::jit::InterpreterStubExitFrameLayout*) () from /usr/local/lib/firefox/libxul.so.99.0
Re: panic: uao_fin_swhash_elt: can't allocate entry
On 22/02/21(Mon) 13:48, Stuart Henderson wrote: > Not much information on this but it's an unusual one so I thought I'd > post in case it's of interest to anyone. (Re-typed from a screen photo, > it's remote and used by non-technical people, this is all I have). > > panic: uao_fin_swhash_elt: can't allocate entry > Stopped at db_enter+0x10: popq %rbp > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 38724523522 10010x100 0 sh > *428940 98261 0 0x14000 0x200 1K pagedaemon > db_enter+0x10 > panic+0x12a > uao_set_swslot(fd80c1ecc980,150,1f4d1) at uao_set_swslot+0x1a1 > uvmpd_scan_inactive(82188790) at uvmpd_scan_inactive+0x537 > uvmpd_scan+0x9f > uvm_pageout(800053d0) at uvm_pageout+0x375 > end trace frame 0x0, count: 9 If it happens again could you include "show uvmexp" and "show all pools".
Re: panic: uao_fin_swhash_elt: can't allocate entry
On 23/02/21(Tue) 07:53, Jonathan Matthew wrote: > On Mon, Feb 22, 2021 at 01:48:01PM +, Stuart Henderson wrote: > > Not much information on this but it's an unusual one so I thought I'd > > post in case it's of interest to anyone. (Re-typed from a screen photo, > > it's remote and used by non-technical people, this is all I have). > > > > panic: uao_fin_swhash_elt: can't allocate entry > > uao_find_swhash_elt(): > > /* allocate a new entry for the bucket and init/insert it in */ > elt = pool_get(&uao_swhash_elt_pool, PR_NOWAIT | PR_ZERO); > /* > * XXX We cannot sleep here as the hash table might disappear > * from under our feet. And we run the risk of deadlocking > * the pagedeamon. In fact this code will only be called by > * the pagedaemon and allocation will only fail if we > * exhausted the pagedeamon reserve. In that case we're > * doomed anyway, so panic. > */ > if (elt == NULL) > panic("%s: can't allocate entry", __func__); > > so it sounds like the machine was so out of memory it couldn't swap. Another hypothesis would be a kind of deadlock, showing "ps", "all pools" and "uvmexp" would help get a better understanding.
libunwind & static+no-pie binaries
Test program below, provided by robert@ blows up when compiled with "-static" and "-no-pie": $ c++ -no-pie -static e.cc && ./a.out Segmentation fault (core dumped) #0 libunwind::EHHeaderParser::decodeEHHdr (addressSpace=..., ehHdrStart=4211204, ehHdrEnd=4876, ehHdrInfo=...) at /usr/src/lib/libcxxabi/../libunwind/src/EHHeaderParser.hpp:60 #1 libunwind::LocalAddressSpace::findUnwindSections(unsigned long, libunwind::UnwindInfoSections&)::{lambda(dl_phdr_info*, unsigned long, void*)#1}::operator()(dl_phdr_info*, unsigned long, void*) const (this=, pinfo=0x24a058 <_static_phdr_info>, data=) at /usr/src/lib/libcxxabi/../libunwind/src/AddressSpace.hpp:598 #2 0x002110c4 in libunwind::LocalAddressSpace::findUnwindSections (this=, targetAddr=, info=...) at /usr/src/lib/libcxxabi/../libunwind/src/AddressSpace.hpp:538 #3 libunwind::UnwindCursor::setInfoBasedOnIPRegister (this=0x7f7f8a08, isReturnAddress=) at /usr/src/lib/libcxxabi/../libunwind/src/UnwindCursor.hpp:1827 #4 0x002103ee in unw_init_local (cursor=0x7f7f8a08, context=) at /usr/src/lib/libcxxabi/../libunwind/src/libunwind.cpp:82 #5 0x0020fd8c in unwind_phase1 (uc=0x20f600 <__cxxabiv1::exception_cleanup_func(_Unwind_Reason_Code, _Unwind_Exception*)>, cursor=0x247f88 +16>, exception_object=0x2749dfe60) at /usr/src/lib/libcxxabi/../libunwind/src/UnwindLevel1.c:39 #6 _Unwind_RaiseException (exception_object=0x2749dfe60) at /usr/src/lib/libcxxabi/../libunwind/src/UnwindLevel1.c:357 #7 0x0020f5f3 in __cxa_throw (thrown_object=0x2749dfe80, tinfo=0x2475c8 , dest=) at /usr/src/lib/libcxxabi/src/cxa_exception.cpp:281 #8 0x0020c38f in division(int, int) () #9 0x0020c410 in main () That means it's currently impossible to profile C++ binaries on OpenBSD, which is what we need :o) #include using namespace std; double division(int a, int b) { if (b == 0) { throw "Division by zero condition!"; } return (a/b); } int main () { int x = 50; int y = 0; double z = 0; for (uint64_t n = 40; n > 0; n--) { try { z = division(x, y); } catch (const char* msg) { } } return 0; }
Signal & half stopped process
When debugging a multi-threaded process with egdb(1), exiting the debugger generally result in this: PID TID PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND 15448 242044 100 64M 179M idle fsleep0:11 0.00% soffice.bin 15448 251679 100 64M 179M stop/2- 0:00 0.00% soffice.bin 15448 367261 20 64M 179M stop/3- 0:00 0.00% soffice.bin 15448 203267 20 64M 179M stop/0- 0:00 0.00% soffice.bin 15448 128499 100 64M 179M stop/1- 0:00 0.00% soffice.bin 15448 369455 20 64M 179M stop/1- 0:00 0.00% soffice.bin One or many threads are still in 'stop'. I need to manually send a SIGCONT for the process to exit. Any idea?
Re: Signal & half stopped process
On 19/11/19(Tue) 11:22, Martin Pieuchot wrote: > When debugging a multi-threaded process with egdb(1), exiting the > debugger generally result in this: > > PID TID PRI NICE SIZE RES STATE WAIT TIMECPU COMMAND > 15448 242044 100 64M 179M idle fsleep0:11 0.00% > soffice.bin > 15448 251679 100 64M 179M stop/2- 0:00 0.00% > soffice.bin > 15448 367261 20 64M 179M stop/3- 0:00 0.00% > soffice.bin > 15448 203267 20 64M 179M stop/0- 0:00 0.00% > soffice.bin > 15448 128499 100 64M 179M stop/1- 0:00 0.00% > soffice.bin > 15448 369455 20 64M 179M stop/1- 0:00 0.00% > soffice.bin > > One or many threads are still in 'stop'. I need to manually send a SIGCONT > for the process to exit. > > Any idea? After reading kernel ptrace(2) and signal code it seems to me that PT_DETACH doesn't handle multi-threaded processes that are in SSTOP correctly. This makes me wonder if `p_xstat' shouldn't be move to "struct process".
Re: USB removal kernel panic
Thanks for the report. > ddb{0}> > memcpy(80165000,fd804da1f728,8d8,80165000,b5bd47118 > ed5c95a,80165000) at memcpy+0x15 > uvideo_vs_cb(fd80778f2870,801667d8,0) at uvideo_vs_cb+0x8b > usb_transfer_complete(fd80778f2870) at usb_transfer_complete+0x20f > xhci_event_dequeue(800af000) at xhci_event_dequeue+0x103 > xhci_softintr(800af000) at xhci_softintr+0x2d > softintr_dispatch(1) at softintr_dispatch+0xf2 > Xsoftnet(0,819c05e0,0,18041969,80,a) at Xsoftnet+0x1f > Xspllower(0,0,c7ef80837208d4cc,8159c000,81983ee1,708000) at > Xsp > llower+0x19 > free(8159c000,2,708000) at free+0x160 > uvideo_detach(80165000,1) at uvideo_detach+0x71 > config_detach(80165000,1) at config_detach+0x152 > usbd_detach(80137500,80086d00) at usbd_detach+0x5a > uhub_port_connect(80086d00,4,2a0,286) at uhub_port_connect+0x68 > uhub_explore(800a9500) at uhub_explore+0x23d > usb_explore(800a9400) at usb_explore+0x12b > usb_task_thread(80001f8efb30) at usb_task_thread+0x10b > end trace frame: 0x0, count: -16 > ddb{0}> > memcpy(80165000,fd804da1f728,8d8,80165000,b5bd47118 It seems that the pipe aren't close when uvideo_detach() is called. This is similar to the recent race fixed in uhidev(4). It would be great to find a generic way of handling this situation. uhidev_detach() calls vdevgone() for example...
Re: USB removal kernel panic
On 15/01/20(Wed) 20:26, Vadim Zhukov wrote: > I have a diff or two for that, will send when I'll come home. After discussing the issue with Peter Stuge, we figured out that the free should happen *after* calling config_detach() for the child device (video(4)). When video(4) is detached it will call: vdevgone()->videoclose()->uvideo_close() this last function will sleep until all I/O are finished or cancelled as part of usbd_pipe_close(9). Diff below should fix the issue. Index: dev/video.c === RCS file: /cvs/src/sys/dev/video.c,v retrieving revision 1.42 diff -u -p -r1.42 video.c --- dev/video.c 6 Oct 2019 17:13:10 - 1.42 +++ dev/video.c 15 Jan 2020 19:11:20 - @@ -463,9 +463,6 @@ videodetach(struct device *self, int fla struct video_softc *sc = (struct video_softc *)self; int maj, mn; - if (sc->sc_fbuffer != NULL) - free(sc->sc_fbuffer, M_DEVBUF, sc->sc_fbufferlen); - /* locate the major number */ for (maj = 0; maj < nchrdev; maj++) if (cdevsw[maj].d_open == videoopen) @@ -474,6 +471,8 @@ videodetach(struct device *self, int fla /* Nuke the vnodes for any open instances (calls close). */ mn = self->dv_unit; vdevgone(maj, mn, mn, VCHR); + + free(sc->sc_fbuffer, M_DEVBUF, sc->sc_fbufferlen); return (0); } Index: dev/usb/uvideo.c === RCS file: /cvs/src/sys/dev/usb/uvideo.c,v retrieving revision 1.205 diff -u -p -r1.205 uvideo.c --- dev/usb/uvideo.c14 Oct 2019 09:20:48 - 1.205 +++ dev/usb/uvideo.c15 Jan 2020 19:09:48 - @@ -644,10 +644,10 @@ uvideo_detach(struct device *self, int f /* Wait for outstanding requests to complete */ usbd_delay_ms(sc->sc_udev, UVIDEO_NFRAMES_MAX); - uvideo_vs_free_frame(sc); - if (sc->sc_videodev != NULL) rv = config_detach(sc->sc_videodev, flags); + + uvideo_vs_free_frame(sc); return (rv); }
make(1) regression
Diff below enables a ptrace(2) regress coming from NetBSD. With usr.bin/make built since -D2020-01-14, that includes -current, it complains during the last test: make: Child (52049) not in table? FAILED That results in a failing test, however the syscall correctly reports EBUSY. Should I commit this first to help you look at the issue? Index: Makefile === RCS file: /cvs/src/regress/lib/libc/sys/Makefile,v retrieving revision 1.2 diff -u -p -r1.2 Makefile --- Makefile13 Jan 2020 17:06:56 - 1.2 +++ Makefile14 Jan 2020 16:01:50 - @@ -30,8 +30,8 @@ PROGS += t_access t_bind t_chroot t_cloc PROGS += t_getgroups t_getitimer t_getlogin t_getpid t_getrusage PROGS += t_getsid t_getsockname t_gettimeofday t_kill t_link t_listen PROGS += t_mkdir t_mknod t_msgctl t_msgget t_msgsnd t_msync t_pipe -PROGS += t_poll t_revoke t_select t_sendrecv t_setuid t_socketpair -PROGS += t_sigaction t_truncate t_umask t_write +PROGS += t_poll t_ptrace t_revoke t_select t_sendrecv t_setuid +PROGS += t_socketpair t_sigaction t_truncate t_umask t_write # failing tests .if 0 @@ -40,7 +40,6 @@ PROGS += t_mlock PROGS += t_mmap PROGS += t_msgrcv PROGS += t_pipe2 -PROGS += t_ptrace PROGS += t_stat PROGS += t_syscall PROGS += t_unlink @@ -57,8 +56,9 @@ setup-t_truncate: ${SUDO} touch truncate_test.root_owned ${SUDO} chown root:wheel truncate_test.root_owned -run-t_chroot: cleanup-t_chroot -cleanup-t_chroot: +run-t_chroot: cleanup-dir +run-t_ptrace: cleanup-dir +cleanup-dir: ${SUDO} rm -rf dir CLEANFILES = access dummy mmap truncate_test.root_owned @@ -100,3 +100,5 @@ run-${PROG}-$n: .endif .include + +clean: cleanup-dir Index: README === RCS file: /cvs/src/regress/lib/libc/sys/README,v retrieving revision 1.2 diff -u -p -r1.2 README --- README 22 Nov 2019 15:59:53 - 1.2 +++ README 28 Nov 2019 17:13:08 - @@ -18,6 +18,7 @@ t_getrusage - no expected fail, PR kern/ t_mknod- remove tests for unsupported file types t_msgget - remove msgget_limit test t_poll - remove pollts_* tests +t_ptrace - change EPERM -> EINVAL for PT_ATTACH of a parent t_revoke - remove basic tests, revoke only on ttys supported t_select - remove sigset_t struct as it is int on OpenBSD @@ -26,7 +27,6 @@ t_mlock - wrong errno, succeeds where n t_mmap - ENOTBLK on test NetBSD is skipping, remove mmap_va0 test t_msgrcv - msgrcv(id, &r, 3 - 1, 0x41, 004000) != -1 t_pipe2- closefrom(4) == -1, remove F_GETNOSIGPIPE and nosigpipe test -t_ptrace - ptrace(0, 0, ((void *)0), 0) != -1 t_stat - invalid GID with doas t_syscall - SIGSEGV t_unlink - wrong errno according to POSIX Index: macros.h === RCS file: /cvs/src/regress/lib/libc/sys/macros.h,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 macros.h --- macros.h19 Nov 2019 19:57:03 - 1.1.1.1 +++ macros.h29 Jan 2020 12:45:56 - @@ -9,6 +9,7 @@ #include #include +#include #define __RCSID(str) #define __COPYRIGHT(str) @@ -26,17 +27,26 @@ int sysctlbyname(char *, void *, size_t int sysctlbyname(char* s, void *oldp, size_t *oldlenp, void *newp, size_t newlen) { - int ktc; - if (strcmp(s, "kern.timecounter.hardware") == 0) - ktc = KERN_TIMECOUNTER_HARDWARE; - else if (strcmp(s, "kern.timecounter.choice") == 0) - ktc = KERN_TIMECOUNTER_CHOICE; +int mib[3], miblen; -int mib[3]; mib[0] = CTL_KERN; - mib[1] = KERN_TIMECOUNTER; - mib[2] = ktc; -return sysctl(mib, 3, oldp, oldlenp, newp, newlen); + if (strcmp(s, "kern.timecounter.hardware") == 0) { + mib[1] = KERN_TIMECOUNTER; + mib[2] = KERN_TIMECOUNTER_HARDWARE; + miblen = 3; + } else if (strcmp(s, "kern.timecounter.choice") == 0) { + mib[1] = KERN_TIMECOUNTER; + mib[2] = KERN_TIMECOUNTER_CHOICE; + miblen = 3; + } else if (strcmp(s, "kern.securelevel") == 0) { + mib[1] = KERN_SECURELVL; + miblen = 2; + } else { + fprintf(stderr, "%s(): mib '%s' not supported\n", __func__, s); + return -42; + } + +return sysctl(mib, miblen, oldp, oldlenp, newp, newlen); } /* t_mlock.c */ Index: t_ptrace.c === RCS file: /cvs/src/regress/lib/libc/sys/t_ptrace.c,v retrieving revision 1.1.1.1 diff -u -p -r1.1.1.1 t_ptrace.c --- t_ptrace.c 19 Nov 2019 19:57:04 - 1.1.1.1 +++ t_ptrace.c 29 Jan 2020 12:54:05 - @@ -171
Re: make(1) regression
On 29/01/20(Wed) 15:00, Marc Espie wrote: > On Wed, Jan 29, 2020 at 02:04:06PM +0100, Martin Pieuchot wrote: > > Diff below enables a ptrace(2) regress coming from NetBSD. > > > > With usr.bin/make built since -D2020-01-14, that includes -current, it > > complains during the last test: > > > > make: Child (52049) not in table? > > FAILED > > > > That results in a failing test, however the syscall correctly reports > > EBUSY. > > > > Should I commit this first to help you look at the issue? > > At first I thought forgetting to handle WIFSTOPPED might explain things. > > But looking more closely, I think the changes in make just made a system > bug more apparent. Indeed I can reproduce it. Thanks for hunting that down!
i915/drm vs WITNESS
Some warnings reported by WITNESS: witness: lock order reversal: 1st 0x81332b38 &rq->lock (&rq->lock) 2nd 0x806a0050 rcs0 (&timeline->lock) lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 __i915_request_submit+0x5b #3 __execlists_submission_tasklet+0x1b9 #4 execlists_submit_request+0x1d1 #5 submit_notify+0x37 #6 __i915_sw_fence_complete+0x40 #7 i915_request_add+0x2d3 #8 i915_gem_init+0x2b9 #9 i915_driver_load+0x81b #10 inteldrm_attachhook+0x2c #11 config_process_deferred_mountroot+0x6b #12 main+0x755 #13 longmode_hi+0x9c lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 execlists_submit_request+0x2a #3 submit_notify+0x37 #4 __i915_sw_fence_complete+0x40 #5 dma_i915_sw_fence_wake+0x1d #6 notify_ring+0x1a8 #7 gen8_gt_irq_handler+0xba #8 gen8_irq_handler+0x114 #9 intr_handler+0x6e #10 Xintr_ioapic_edge16_untramp+0x19f #11 acpicpu_idle+0x1d2 #12 sched_idle+0x225 #13 proc_trampoline+0x1c witness: lock order reversal: 1st 0x81332678 &wqh->lock (&wqh->lock) 2nd 0x806a0050 rcs0 (&timeline->lock) lock order "&wqh->lock"(mutex) -> "&timeline->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 execlists_submit_request+0x2a #3 submit_notify+0x37 #4 __i915_sw_fence_complete+0x40 #5 i915_sw_fence_wake+0x39 #6 __i915_sw_fence_complete+0x131 #7 dma_i915_sw_fence_wake+0x1d #8 notify_ring+0x1a8 #9 gen8_gt_irq_handler+0xba #10 gen8_irq_handler+0x114 #11 intr_handler+0x6e #12 Xintr_ioapic_edge16_untramp+0x19f #13 acpicpu_idle+0x1d2 #14 sched_idle+0x225 #15 proc_trampoline+0x1c witness: acquiring duplicate lock of same type: "&wqh->lock" 1st &wqh->lock 2nd &wqh->lock Starting stack trace... witness_checkorder(81333980,9,0) at witness_checkorder+0x6ba mtx_enter(81333970) at mtx_enter+0x34 __i915_sw_fence_complete(81333970,800022280270) at __i915_sw_fence_complete+0x58 i915_sw_fence_wake(813339c8,1,0,800022280270) at i915_sw_fence_wake+0x39 __i915_sw_fence_complete(81332668,0) at __i915_sw_fence_complete+0x131 dma_i915_sw_fence_wake(813322c8,81355b20) at dma_i915_sw_fence_wake+0x1d notify_ring(80a75000) at notify_ring+0x1a8 gen8_gt_irq_handler(80154000,2,8000222803b0) at gen8_gt_irq_handler+0xba gen8_irq_handler(0,80154078) at gen8_irq_handler+0x114 intr_handler(800022280450,8013fd00) at intr_handler+0x6e Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x19f acpicpu_idle() at acpicpu_idle+0x1d2 sched_idle(81e0) at sched_idle+0x225 end trace frame: 0x0, count: 244 End of stack trace.
Re: NSD sendto issue
On 17/02/20(Mon) 14:55, Joerg Jung wrote: > > > On 26. Sep 2019, at 15:02, Stuart Henderson wrote: > > On 2019/09/26 13:45, Stuart Henderson wrote: > >> On 2019/09/26 11:16, Joerg Jung wrote: > >>> Hi, > >>> > >>> I run a few busy (~800 req/s) NSD servers which I upgraded > >>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any > >>> sysctl values and nsd.conf matches the default as well, just > >>> added a few hundred zones. > >>> > >>> Now, when I increase servers from default 1 to 2 in nsd.conf: > >>> server-count: 2 > >>> it starts spamming my log with: > >>> nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable > >>> > >>> checking the source, server.c seems not to handle EAGAIN > >>> after sendto() and does not recover or retry, it just increases > >>> txerr statistic count - so answer seems really lost :( > >>> > >>> I tried higher debug level, as well as increasing socket buffers to: > >>> net.inet.udp.recvspace= 65536 > >>> net.inet.udp.sendspace=65636 > >>> but both didn’t help and netstat -s -p udp does show > >>> 0 dropped due to full socket buffers > >>> anyways. So, I don’t believe this is a socket buffer issue. > >>> > >>> The same server-count: 2 setting worked fine with 6.3. > >>> > >>> Any hints, insights, or pointers? > >>> Does anyone else experience the same? > >>> > >>> Thanks, > >>> Regards, > >>> Joerg > >> > >> Maybe it's worth trying to track down further whether this is due to an > >> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure > >> to use "make -f Makefile.bsd-wrapper [..]" when building). > >> > > > > Or, following a comment from claudio@, try a kernel built with this: > > FYI, I tried that diff and a few other things but neither did help. Did you ktrace(1) the problem? How is sendto(2) called, in particular is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor? Does that mean the kernel returns EWOULDBLOCK even if the userland said it is fine to block? > > > Index: syscalls.master > > === > > RCS file: /cvs/src/sys/kern/syscalls.master,v > > retrieving revision 1.189 > > diff -u -p -r1.189 syscalls.master > > --- syscalls.master 11 Jan 2019 18:46:30 - 1.189 > > +++ syscalls.master 26 Sep 2019 13:01:46 - > > @@ -261,7 +261,7 @@ > > 130 OBSOL oftruncate > > 131 STD { int sys_flock(int fd, int how); } > > 132 STD { int sys_mkfifo(const char *path, mode_t mode); } > > -133STD NOLOCK { ssize_t sys_sendto(int s, const void *buf, \ > > +133STD { ssize_t sys_sendto(int s, const void *buf, \ > > size_t len, int flags, const struct sockaddr *to, \ > > socklen_t tolen); } > > 134 STD { int sys_shutdown(int s, int how); } > > > > > > Run "make syscalls" in sys/kern before building. >
Re: upd(4): force boolean indicator to be 0 or 1
On 27/02/20(Thu) 16:58, boudew...@indes.com wrote: > >Synopsis:boolean indicators in sensorsd.conf(5) are too cumbersome > >Category:system > >Environment: > System : OpenBSD 6.6 > Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 MDT > 2019 > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > Some upd(4) devices use -1 for "On" and some use 1. sysctl(8) and senso > rsd(8) hide this detail from the user, which makes it difficult to define low > an > d high values in sensorsd.conf(5). Which device reports "-1" for which usage? Is this from any specification or is it a workaround for your device? Diff looks fine, although we could do simpler, see below. Index: upd.c === RCS file: /cvs/src/sys/dev/usb/upd.c,v retrieving revision 1.26 diff -u -p -r1.26 upd.c --- upd.c 8 Apr 2017 02:57:25 - 1.26 +++ upd.c 27 Feb 2020 16:25:24 - @@ -425,7 +425,10 @@ upd_sensor_update(struct upd_softc *sc, } hdata = hid_get_data(buf, len, &sensor->hitem.loc); - sensor->ksensor.value = hdata * adjust; + if (sensor->ksensor.type == SENSOR_INDICATOR) + sensor->ksensor.value = hdata ? 1 : 0; + else + sensor->ksensor.value = hdata * adjust; sensor->ksensor.status = SENSOR_S_OK; sensor->ksensor.flags &= ~SENSOR_FINVALID;
Re: upd(4): force boolean indicator to be 0 or 1
On 28/02/20(Fri) 10:02, Boudewijn Dijkstra wrote: > Op Thu, 27 Feb 2020 17:30:34 +0100 schreef Martin Pieuchot > : > > On 27/02/20(Thu) 16:58, boudew...@indes.com wrote: > > > >Synopsis:boolean indicators in sensorsd.conf(5) are too > > > >cumbersome > > > >Category:system > > > >Environment: > > > System : OpenBSD 6.6 > > > Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 > > > MDT 2019 > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > > > Architecture: OpenBSD.amd64 > > > Machine : amd64 > > > >Description: > > > Some upd(4) devices use -1 for "On" and some use 1. sysctl(8) and > > > sensorsd(8) hide this detail from the user, which makes it difficult > > > to define low and high values in sensorsd.conf(5). > > > > Which device reports "-1" for which usage? Is this from any > > specification or is it a workaround for your device? > > In the misc@ thread I linked it was reported that different devices use > different values. My device happens to report -1 for "On". Given how > sensorsd.conf currently works, it would be most convenient if 0 and 1 were > the only possible values. You're rephrasing your diff in words. My question is: can there be any drawback to this approach? Did you check the spec? Why is your UPS returning -1 and not 1 in this case? Is this the right place to fix the bug? > > Diff looks fine, although we could do simpler, see below. > > > > Index: upd.c > > === > > RCS file: /cvs/src/sys/dev/usb/upd.c,v > > retrieving revision 1.26 > > diff -u -p -r1.26 upd.c > > --- upd.c 8 Apr 2017 02:57:25 - 1.26 > > +++ upd.c 27 Feb 2020 16:25:24 - > > @@ -425,7 +425,10 @@ upd_sensor_update(struct upd_softc *sc, > > } > > hdata = hid_get_data(buf, len, &sensor->hitem.loc); > > - sensor->ksensor.value = hdata * adjust; > > + if (sensor->ksensor.type == SENSOR_INDICATOR) > > + sensor->ksensor.value = hdata ? 1 : 0; > > + else > > + sensor->ksensor.value = hdata * adjust; > > sensor->ksensor.status = SENSOR_S_OK; > > sensor->ksensor.flags &= ~SENSOR_FINVALID; > > Your diff is indeed simpler, but I thought it would be cleaner to not assign > 'adjust' when it's not needed. That's an improvement indeed, but it isn't related to the bug you're trying to fix ;)
Re: upd(4): force boolean indicator to be 0 or 1
On 28/02/20(Fri) 12:34, Boudewijn Dijkstra wrote: > Op Fri, 28 Feb 2020 11:14:43 +0100 schreef Martin Pieuchot > : > > On 28/02/20(Fri) 10:02, Boudewijn Dijkstra wrote: > > > Op Thu, 27 Feb 2020 17:30:34 +0100 schreef Martin Pieuchot > > > : > > > > On 27/02/20(Thu) 16:58, boudew...@indes.com wrote: > > > > > >Synopsis:boolean indicators in sensorsd.conf(5) are too > > > > > >cumbersome > > > > > >Category:system > > > > > >Environment: > > > > > System : OpenBSD 6.6 > > > > > Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 > > > > > MDT 2019 > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > > > > > > > Architecture: OpenBSD.amd64 > > > > > Machine : amd64 > > > > > >Description: > > > > > Some upd(4) devices use -1 for "On" and some use 1. sysctl(8) > > > > > and > > > > > sensorsd(8) hide this detail from the user, which makes it difficult > > > > > to define low and high values in sensorsd.conf(5). > > > > > > > > Which device reports "-1" for which usage? Is this from any > > > > specification or is it a workaround for your device? > > > > > > In the misc@ thread I linked it was reported that different devices use > > > different values. My device happens to report -1 for "On". Given how > > > sensorsd.conf currently works, it would be most convenient if 0 and > > > 1 were the only possible values. > > > > You're rephrasing your diff in words. My question is: can there be any > > drawback to this approach? > > They're boolean indicators, I don't think there can be any. sensorsd(8) is > the only program in base that can use upd(4) sensors. sysctl(8) already > treats them as booleans. Obviously some people will have to change their > sensorsd.conf(5) if this goes in. > > > Did you check the spec? > > I checked "Universal Serial Bus Usage Tables for HID Power Devices" > https://www.usb.org/sites/default/files/documents/pdcv10.pdf > For every boolean indicator it specifies two allowed values: 0 and 1. > > > Why is your UPS returning -1 and not 1 in this case? > > No idea. It's working fine. Some further testing revealed that On==-1 (and > Off==0) for all the indicators that I can easily toggle (Charging, > Discharging, ACPresent). > > > Is this the right place to fix the bug? > > I think so. It's a violation of HID Power, not generic HID, so it should be > fixed in a place where HID Power data is (first) interpreted. Thanks for checking. I committed the simpler diff. If you have any other improvement for upd(4) or any other part of the system you're welcome to submit them to tech@. Thanks again, Martin
Re: [macppc] GENERIC.MP panics under high load
On 27/03/20(Fri) 22:43, Charlene Wendling wrote: > Hi, > > >Environment: > System : OpenBSD 6.6 > Details : OpenBSD 6.6-current (GENERIC.MP) #676: Fri Feb 14 > 02:26:37 MST 2020 > dera...@macppc.openbsd.org:/usr/src/sys/arch/macppc/compile/GENERIC.MP > > Architecture: OpenBSD.macppc > Machine : macppc > >Description: > > Note that it's still reproducible with more recent snapshots. > > Running GENERIC.MP causes kernel panics if it's under high > load. Running GENERIC causes no such issues on the two dual > core machines belonging to the macppc ports building cluster. > > It's happening since early December 2019, but is occurring even > more since the last few weeks, at a rate becoming harmful, hence my > report. > > >How-To-Repeat: > > Start a bulk with dpb(1) with GENERIC.MP, it should panic anytime > before 4 days. If you're lucky it will crash straight while listing > ports. Thanks for the report. If you have the patience to continue gather such crash please do send the same report every time. It is interesting to see that CPU0 is in uvm_swap_io() here. It would be nice to know if there's a common pattern between what seems to be a memory corruption on CPU1 and what CPU0 is doing at that moment. This might be a MD or MI bug, so the more information you get us the better :o) > > >Fix: > > None. > > -- > > ddb{1}> machine ddbcpu 0 > Stopped at db_enter+0x10: lwz r0,36(r1) > db_enter() at db_enter+0xc > openpic_ipi_ddb() at openpic_ipi_ddb+0xc > openpic_ext_intr() at openpic_ext_intr+0x254 > extint_call() at extint_call > --- interrupt --- > at 0xe000dffc > ttyinput(e0005a00,e0008100) at ttyinput+0x8c > zstty_rxsoft(6428,e0019000) at zstty_rxsoft+0x150 > zstty_softint(5ab65d38) at zstty_softint+0xb0 > zsc_intr_soft(ecd8) at zsc_intr_soft+0x7c > zssoft(ecd8) at zssoft+0x64 > softintr_dispatch(ec00) at softintr_dispatch+0x80 > dosoftint(1) at dosoftint+0xa4 > openpic_splx(100) at openpic_splx+0xa4 > splx(65727000) at splx+0x1c > end trace frame: 0xe629c780, count: 0 > > ddb{0}> trace > db_enter() at db_enter+0xc > openpic_ipi_ddb() at openpic_ipi_ddb+0xc > openpic_ext_intr() at openpic_ext_intr+0x254 > extint_call() at extint_call > --- interrupt --- > at 0xe000dffc > ttyinput(e0005a00,e0008100) at ttyinput+0x8c > zstty_rxsoft(6428,e0019000) at zstty_rxsoft+0x150 > zstty_softint(5ab65d38) at zstty_softint+0xb0 > zsc_intr_soft(ecd8) at zsc_intr_soft+0x7c > zssoft(ecd8) at zssoft+0x64 > softintr_dispatch(ec00) at softintr_dispatch+0x80 > dosoftint(1) at dosoftint+0xa4 > openpic_splx(100) at openpic_splx+0xa4 > splx(65727000) at splx+0x1c > tsleep(6428,92,e629c7d0,0) at tsleep+0x98 > biowait(1) at biowait+0x5c > uvm_swap_io(,0,0,2000) at uvm_swap_io+0x5f4 > uvm_swap_get(3e60590,3e60590,e629c8e0) at uvm_swap_get+0x58 > uvmfault_anonget(400,5,e629c930) at uvmfault_anonget+0x1ac > uvm_fault(6ab1e668,40f8050,e629c970,20009034) at uvm_fault+0x554 > trap(6f3b63c8) at trap+0x68c > trapagain() at trapagain+0x4 > --- trap (type 0x300) --- > at 0xe629cbf0 > ureadc(e0005a00,0) at ureadc+0x128 > ttread(6ab49338,300,e629cc90) at ttread+0x368 > zsread(f4f958,40004048,1a2454c0) at zsread+0x58 > spec_read(fe2f60) at spec_read+0x354 > ufsspec_read(2001) at ufsspec_read+0x20 > VOP_READ(925e6c,f4f680,e629cdd0,0) at VOP_READ+0x50 > vn_read(1,1,e629ce20) at vn_read+0xc4 > dofilereadv(6ab49338,e629ce48,e629cec0,6ab49374,2e) at dofilereadv+0xd0 > sys_read(d891b0a8,6ab49374,e629cea4) at sys_read+0x64 > trap(6ab49338) at trap+0x9f0 > trapagain() at trapagain+0x4 > --- syscall (number 3) --- > End of kernel: 0xfffcef70 > end trace frame: 0xfffcef70, count: -34 > > ddb{0}> machine ddbcpu 1 > Stopped at db_enter+0x10: lwz r0,36(r1) > db_enter() at db_enter+0xc > panic(0) at panic+0xe0 > rw_assert_rdlock(e61f9e88) at rw_assert_rdlock+0x60 > rw_exit_read(9737f8) at rw_exit_read+0x1c > if_input_process(792280,e61f9f28) at if_input_process+0x68 > ifiq_process() at ifiq_process+0x78 > taskq_thread(e0007040) at taskq_thread+0x58 > fork_trampoline() at fork_trampoline+0x14 > end trace frame: 0x0, count: 7 > > ddb{1}> trace > db_enter() at db_enter+0xc > panic(0) at panic+0xe0 > rw_assert_rdlock(e61f9e88) at rw_assert_rdlock+0x60 > rw_exit_read(9737f8) at rw_exit_read+0x1c > if_input_process(792280,e61f9f28) at if_input_process+0x68 > ifiq_process() at ifiq_process+0x78 > taskq_thread(e0007040) at taskq_thread+0x58 > fork_trampoline() at fork_t
Re: 6.6-current stutters after heavy disk loads
On 28/03/20(Sat) 09:33, Martin wrote: > After about a week of tests on freshly installed system i can conclude that > two things affect on stutters 6.6 amd64 with all the patches included. To > exclude hardware related problems I've changed AMD SOC PC to a new different > one with the exactly the same configuration. What do you mean with "stutters"? Could you run "top -SH -s .3" and describe what you seen what that happens? Does those "stutters", or whatever you mean with that, are present with GENERIC (non MP)?
Re: arpresolve: XX: route contains no arp information
On 28/03/20(Sat) 15:30, Stuart Henderson wrote: > After updating my laptop from > > OpenBSD 6.6-current (GENERIC.MP) #653: Thu Feb 20 21:40:37 MST 2020 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > to > > OpenBSD 6.6-current (GENERIC.MP) #84: Fri Mar 27 23:50:29 MDT 2020 > > I've started seeing a lot of these: > > /bsd: arpresolve: 10.15.5.1: route contains no arp information > last message repeated 436 times > > Local subnet is working, traffic going via default route is not (ping > reports "sendmsg: Invalid argument". > > Network config is dhcp running on iwm0/em0 as separate interfaces (no trunk, > just using default priorities to prefer wired) and the wlan is on the same > subnet as ethernet. This previously worked just fine. ifconfig/route > tables/dmesg > are below. > > While I'm bisecting, does anyone have an idea what might have introduced it? If that happens on the em(4) and not the iwm(4) that would indicate that one of the em(4) changes might be the cause of the regression. > > $ ifconfig | sed 's/IMEI [0-9]* /IMEI xxx /' > lo0: flags=8049 mtu 32768 > index 4 priority 0 llprio 3 > groups: lo > inet6 ::1 prefixlen 128 > inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 > inet 127.0.0.1 netmask 0xff00 > iwm0: > flags=a08843 mtu > 1500 > lladdr e4:a4:71:4f:84:36 > index 1 priority 4 llprio 3 > groups: wlan egress > media: IEEE802.11 autoselect (HT-MCS8 mode 11n) > status: active > ieee80211: join Y2 chan 6 bssid 04:4f:aa:0c:3a:e8 65% wpakey wpaprotos > wpa2 wpaakms psk wpaciphers ccmp wpagroupcipher ccmp > inet 10.15.5.125 netmask 0xff00 broadcast 10.15.5.255 > inet6 fe80::e6a4:71ff:fe4f:8436%iwm0 prefixlen 64 scopeid 0x1 > inet6 2a02:8011:7003:3:650f:596b:8366:9863 prefixlen 64 autoconf pltime > 604462 vltime 2591662 > inet6 2a02:8011:7003:3:cd3b:7b95:ddd3:b52d prefixlen 64 autoconf > autoconfprivacy pltime 85922 vltime 604432 > em0: flags=a08843 > mtu 1500 > lladdr c8:5b:76:cf:a8:ca > index 2 priority 0 llprio 3 > groups: egress > media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause) > status: active > inet 10.15.5.82 netmask 0xff00 broadcast 10.15.5.255 > inet6 fe80::ca5b:76ff:fecf:a8ca%em0 prefixlen 64 scopeid 0x2 > inet6 2a02:8011:7003:3:984f:bf36:3107:f0 prefixlen 64 autoconf pltime > 604462 vltime 2591662 > inet6 2a02:8011:7003:3:77ba:944a:ab32:46c2 prefixlen 64 autoconf > autoconfprivacy pltime 85608 vltime 604427 > enc0: flags=0<> > index 3 priority 0 llprio 3 > groups: enc > status: active > umb0: flags=8810 mtu 1500 > index 5 priority 6 llprio 3 > roaming disabled registration not registered > state open cell-class none > SIM not initialized PIN valid (3 attempts left) > device EM7455 IMEI XXX firmware SWI9X30C_02.24.05.06 > APN pp.vodafone.co.uk > status: down > pflog0: flags=141 mtu 33136 > index 6 priority 0 llprio 3 > groups: pflog > > > $ netstat -rnfinet > Routing tables > > Internet: > DestinationGatewayFlags Refs Use Mtu Prio Iface > default10.15.5.1 UGS2 2233 - 8 em0 > default10.15.5.1 UGS00 -12 iwm0 > 224/4 127.0.0.1 URS0 476 32768 8 lo0 > 10.15.5/24 10.15.5.82 UCn3 128 - 4 em0 > 10.15.5/24 10.15.5.125UCn10 - 8 iwm0 > 10.15.5.1 00:00:5e:00:01:05 UHLch 18 - 7 iwm0 > 10.15.5.2 f8:b1:56:ac:32:76 UHLc 0 64 - 3 em0 > 10.15.5.5 link#2 UHLc 0 43 - 3 em0 > 10.15.5.9 dc:a6:32:03:7a:01 UHLc 0 69 - 3 em0 > 10.15.5.82 c8:5b:76:cf:a8:ca UHLl 09 - 1 em0 > 10.15.5.125e4:a4:71:4f:84:36 UHLl 07 - 1 iwm0 > 10.15.5.25510.15.5.82 UHPb 0 36 - 1 em0 > 10.15.5.25510.15.5.125UHPb 00 - 1 iwm0 > 127/8 127.0.0.1 UGRS 00 32768 8 lo0 > 127.0.0.1 127.0.0.1 UHhl 2 159 32768 1 lo0 > > > OpenBSD 6.6-current (GENERIC.MP) #84: Fri Mar 27 23:50:29 MDT 2020 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > real mem = 8438898688 (8047MB) > avail mem = 8170549248 (7792MB) > mpath0 at root > scsibus0 at mpath0: 256 targets > mainbus0 at root > bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xd705d000 (63 entries) > bios0: vendor LENOVO version "R02ET70W (1.43 )" date 01/28/2019 > bios0: LENOVO 20F6006YUK > acpi0 at bios0: ACPI 5.0 > acpi0: sleep states S0 S3 S4 S5 > acpi0: tables DSDT FACP UEFI SSDT SSDT ECDT HPET APIC MCFG SSD
Re: arpresolve: XX: route contains no arp information
On 29/03/20(Sun) 17:17, Stuart Henderson wrote: > [...] > I guess I'll just move it to a wifi network on a different vlan for now. Well I wouldn't be surprise if the issue is exposed by the use of two cloning routes. One way to move forward would be to add the name of the interface in the error message.
Re: arpresolve: XX: route contains no arp information
On 30/03/20(Mon) 22:11, Stuart Henderson wrote: > On 2020/03/30 10:29, Martin Pieuchot wrote: > > On 29/03/20(Sun) 17:17, Stuart Henderson wrote: > > > [...] > > > I guess I'll just move it to a wifi network on a different vlan for now. > > > > Well I wouldn't be surprise if the issue is exposed by the use of two > > cloning routes. One way to move forward would be to add the name of the > > interface in the error message. > > > > The arpresolve "route contains no arp information" is on em0. > The only entry relating to the gateway (10.15.5.1) showing in > arp -an after boot is > > 10.15.5.1 (incomplete) iwm0 Expired > > (no entry for the gateway on em0). > > Oddities: > > - if I connect to a WPA-PSK network then iwm0 comes up quickly in > the "starting network" stage of /etc/rc and I have the problem. But > if I switch to a WPA-Enterprise network which doesn't connect until > wpa_supplicant starts ("starting package daemons" stage) then I don't > see the problem. > > - but it's not just timing related though: I can add "!ping -c1 1.1.1.1" > or "!sleep 60" to hostname.em0 and still see the problem. Are you saying that an incomplete ARP entry exists and has been cloned via the route attached via iwm0? However this entry is incomplete. Now when trying to reach the address pointed by this ARP entry via em0 no other cloned entry is created and you get the "no arp info" message. That would imply the ARP cache isn't correctly flushed when em0 receives an address on the same subnet as iwm0.
Re: 6.6-current stutters after heavy disk loads
On 31/03/20(Tue) 15:08, Martin wrote: > 1. top -SH -s .3 points me that stutters arrive once process changing its > state from 'idle' to 'active' with related disk activity. What about %spin and %intr? > 2. Any machine with 6.6 GENERIC.MP affected. > 2.1. 4-core AMD GX-420CA SOC - stutters more visible; > 2.2. 2-core Intel i7-2640m - very rare stutters when process changing its > state from 'idle' to 'active'. > 3. GENERIC (no MP) - stutters are minimal, after 48 hours I can see them > very, very rare and on AMD SOC only. Valuable information.
Re: 6.6-current stutters after heavy disk loads
On 02/04/20(Thu) 12:58, Martin wrote: > ‐‐‐ Original Message ‐‐‐ > On Tuesday, March 31, 2020 3:27 PM, Martin Pieuchot wrote: > > > On 31/03/20(Tue) 15:08, Martin wrote: > > > > > 1. top -SH -s .3 points me that stutters arrive once process changing > > > its state from 'idle' to 'active' with related disk activity. > > > > What about %spin and %intr? > > 1. AMD GX-420CA SOC 4-core 4-thread > > CPU0 %spin from 2.0% to 17.0% %intr 30.0%-96.0% > CPU1-3 %spin 0.0% (always) %intr 15.0%-99.0% > > 2. i7-2640m 2-core 4-thread > > CPU0 %spin from 0.0% to 3.0% %intr 0.0% (always) > CPU2 %spin from 0.0% to 2.0% (rare) %intr 0.0% (always) Interesting so whatever that is it seems related or amplified by a lot of time spent dealing with interrupt. You can use "systat -s .3" and/or "vmstat -i" to figure out which interrupt has a higher rate when you observe the symptoms. If nobody has a idea of what that could be, another useful information would be to produce a flamegraph when you observe the stutters. For that you need to enable dt(4) in conf/GENERIC build & install a new kernel, build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf. After rebooting in the new kernel run the following: # btrace -e 'profile:hz:15 { printf("%s1\n", kstack); }' > kstack.txt and it Ctrl+C to stop the profiling. Then you can build the Flamegraph with the tools described below or provide us the captured stack traces: https://github.com/brendangregg/FlameGraph
Re: 6.6-current stutters after heavy disk loads
On 02/04/20(Thu) 13:59, Martin wrote: > ‐‐‐ Original Message ‐‐‐ > On Thursday, April 2, 2020 1:21 PM, Martin Pieuchot wrote: > > On 02/04/20(Thu) 12:58, Martin wrote: > > > > > ‐‐‐ Original Message ‐‐‐ > > > On Tuesday, March 31, 2020 3:27 PM, Martin Pieuchot m...@openbsd.org > > > wrote: > > > > > > > On 31/03/20(Tue) 15:08, Martin wrote: > > > > > > > > > 1. top -SH -s .3 points me that stutters arrive once process > > > > > changing its state from 'idle' to 'active' with related disk activity. > > > > > > > > What about %spin and %intr? > > > > > > 1. AMD GX-420CA SOC 4-core 4-thread > > > > > > CPU0 %spin from 2.0% to 17.0% %intr 30.0%-96.0% > > > CPU1-3 %spin 0.0% (always) %intr 15.0%-99.0% > > > > > > 2. i7-2640m 2-core 4-thread > > > > > > CPU0 %spin from 0.0% to 3.0% %intr 0.0% (always) > > > CPU2 %spin from 0.0% to 2.0% (rare) %intr 0.0% (always) > > > > Interesting so whatever that is it seems related or amplified by a lot > > of time spent dealing with interrupt. > > > > You can use "systat -s .3" and/or "vmstat -i" to figure out which > > interrupt has a higher rate when you observe the symptoms. > > 1. AMD SOC > systat -s .3 seems interrupts too (stutters) when system wide stutter appears. > Interrupts > 500-1200 total > 96-98 clock > 155-350,sometimes up to 1100 ipi A lot of IPIs! We're making progress. This rings a bell, I'd suggest you look at my slides/talk from EuroBSDCon2017 called: "Your scheduler is not the problem". This might not be a similar problem but it gives a lot of insides about how to debug further. Which application are you running to trigger those? What is the "background process" that you're talking about? Did you ktrace(1) it? What is it doing when you see the stutters? The picture now seems to be clearer: something is causing a high number of IPI. That creates latency and all other task are somehow delayed resulting in some stuttering. The question now becomes: why so many IPIs are being generated and is it possible to lower the insanely high rate. Please make sure to do the ktracing first, that should give us the userland view of the situation. Then you could additionally do the Flamegraph gathering which should give us the kernel view of situation. > > If nobody has a idea of what that could be, another useful information > > would be to produce a flamegraph when you observe the stutters. For that > > you need to enable dt(4) in conf/GENERIC build & install a new kernel, > > build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf. > > After rebooting in the new kernel run the following: > > > > btrace -e 'profile:hz:15 { printf("%s1\n", kstack); }' > kstack.txt > > > > > > > > and it Ctrl+C to stop the profiling. > > > > Then you can build the Flamegraph with the tools described below or > > provide us the captured stack traces: > > https://github.com/brendangregg/FlameGraph > >
Re: Fwd: Re: bird crashes kernel
On 02/04/20(Thu) 16:22, Bastien Durel wrote: > Hello, > > Here is the initial report I made on misc@ about a kernel panic > triggered by route removal by bird (bird-2.0.6 from ports) This should be fixed in -current by a commit krw@ did back in November, could you test a snapshot and see? Cheers, Martin
Re: Fwd: Re: bird crashes kernel
On 02/04/20(Thu) 18:30, Bastien Durel wrote: > Le jeudi 02 avril 2020 à 17:15 +0200, Martin Pieuchot a écrit : > > On 02/04/20(Thu) 16:22, Bastien Durel wrote: > > > Hello, > > > > > > Here is the initial report I made on misc@ about a kernel panic > > > triggered by route removal by bird (bird-2.0.6 from ports) > > > > This should be fixed in -current by a commit krw@ did back in > > November, > > could you test a snapshot and see? > > > I prefer not to run my main router on a snapshot, but a VM crashes too > when stopping bird with 6.6-stable, and indeed does not when running > the last snapshot. Thanks for confirming. > But after stopping bird, the network is down (I cannot even ping the > gateway, although dhcp works) -- and now the VM does no boot anymore :/ If you mind, please send a different mail with all the details since that is not the same issue.
Re: 6.6-current stutters after heavy disk loads
On 02/04/20(Thu) 18:40, Martin wrote: > Before starting the video 2017 bsdcon, disabled all the packages software on > both AMD and i7 and run mpv player and test both machines. What do you mean? Which software are running? What do you see in "top -SH -s .3"? > Shutters on both platforms happened when APM change low CPU frequency to > high. Maybe it's an apmd issue? No it is not, it is just a symptom. Please let's stick to the original question: which piece of software are you running when you see the stutters. Is it mpv(1)? When running mpv(1) do you see high IPIs? If so did you ktrace(1) it?
Re: 6.6-current stutters after heavy disk loads
On 03/04/20(Fri) 09:40, Martin wrote: > Hello, Martin. > [...] > When I run mpv and try to watch 720p video. In case of stutters after some > time of watching audio flow desyncronized with video flow and mpv show video > FPS/2 rate afterwards. > > Each time of stutter mpv increase 'Dropped' like > > A-V: 0.000 Dropped: 58++ Cahce: 1378s+154MB Ok so the piece of software is mpv(1). > I did ktrace for mpv process. I run and see by 'kdump -H ktrace.out' that it > has > one process ID and / mostly one-three thread used. > But sometimes (assuming in stutter times) it jumping against treads with > different numbers. Could you upload the output of kdump -H somewhere such that I could look at it or compress it and send it? > Yes, IPI increased to 900-1000 when stutter appears. > > I'm going to disable step-by step each 'out of the box' software to determine > the reason. Am I right doing this way? I believe it isn't necessary. From what you are saying it seems that mpv(1) alone is the piece of software exposing the issue. There are multiple possible reasons for IPIs, but if the high rate you're seeing is exposed by mpv(1), it would suggest they are related to scheduling. By looking at the output of ktrace(1) we should have a better understanding of what is happening in userland. I'd suggest you also do the btrace(8) profiling so we can see which code path in the kernel is responsible for the IPIs. These should allows us to work with facts and not guesses. Thanks for your efforts, Martin
Re: [macppc] GENERIC.MP panics under high load
On 06/04/20(Mon) 16:54, Charlene Wendling wrote: > On Wed, 1 Apr 2020 20:27:54 +0200 > Charlene Wendling wrote: > > I've got another one, still with: > > OpenBSD 6.6-current (GENERIC.MP) #692: Sat Mar 21 10:19:57 MDT 2020 > dera...@macppc.openbsd.org:/usr/src/sys/arch/macppc/compile/GENERIC.MP Trying a WITNESS kernel might give us more information. That would require somebody implementing stacktrace_save() for powerpc.
Re: arpresolve: XX: route contains no arp information
Thanks for your report. On 07/04/20(Tue) 16:04, Laurent Salle wrote: > On 06/04/2020 14.36, Laurent Salle wrote: > > If you wish, I may do some more test the next time the problem occurs. > > I've done more tests. > > This time, I've noticed the following message on the console "arpresolve: > unresolved and rt_expire == 0" It's the same bug as reported by sthen@. Two interfaces in the same subnet have two identical cloning routes: > 192.168.1/24 192.168.1.4UCn1 887 - 4 em0 > 192.168.1/24 192.168.1.18 UCn1 47 - 8 iwm0 ARP entries are "cloned" from one of these two. It should be only one at a time, obviously the one with higher priority. Then I believe dhclient(8) inserts default routes for both interfaces: > default192.168.1.254 UGS526091 - 8 em0 > default192.168.1.254 UGS0 1671 -12 iwm0 One of these routes, the one with higher priority, is picked when sending packets to "8.8.8.8", now the kernel needs to find the ARP entry corresponding to "192.168.1.254": > 192.168.1.254 f4:ca:e5:55:0d:2d UHLc 0 1108 - 3 em0 > 192.168.1.254 link#1 UHLch 1 1468 - 7 iwm0 First question is why an entry is cached ('h') when the other isn't? Both should be. Second question is why the entry on iwm0 is returned when the query is done on em0. Answering those questions should be enough to fix the bug :o)
Re: Crash while using ospfd over vxlan
On 09/04/20(Thu) 16:10, Massimiliano Stucchi wrote: > >Synopsis:Crash while using ospfd over vxlan > >Category:bug > >Environment: > System : OpenBSD 6.6 > Details : OpenBSD 6.6 (GENERIC.MP) #5: Sun Feb 16 01:56:11 MST 2020 > > r...@syspatch-66-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > Setting up an OSPF session over VXLAN leads to a kernel crash > >How-To-Repeat: > > I have setup an ospf session over a vxlan interface. When this is up, > it takes about 2-3 minutes for the crash to consistently happen. > > No other action is necessary. > > At this address: > > https://max.stucchi.ch/bugreport/ > > you can find screenshots from the ddb prompt, including a full trace. > > If needed, I can also provide access to the console. It's a recursion. I don't know anything about vxlan(4) or how the encapsulation works but the following happens at least 10 times: ... vxlan_lookup() udp_input() ip_deliver() ip_ours() ip_input_if() ipv4_input() ether_input() if_vinput() vxlan_lookup() ... Maybe you can share your setup (vxlan config, ospf config, etc) so somebody can try to reproduce and fix it.
Re: [pc engines apu1d4] kernel crash periodically
Hello Pascal, On 09/04/20(Thu) 16:10, Pascal Cabaud wrote: > > Synopsis: Once a day, my APU1D4 used to crash > > Category: kernel > > Environment: > System : OpenBSD 6.6 > Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 MDT > 2019 > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > > Description: > Periodically, my AP1D4 (pcengines.ch) used to crash. Here, the last > crash : Thanks for the report. This looks to me like a memory corruption in the kernel. This is hard to understand because the panic(9) only show the symptom, not the actual bug. I see it the ps output below that you're running quite some different pieces of software (relayd, collectd, etc). Could you try disabling one for some time and see if the crash disappear? This would be an indication that that particular piece of software exposes the bug. If the crash doesn't disappear with that particular piece of software, you can enable it again an disable the next one :o) Thanks! > Stopped at pool_gc_pages+0x67: movq0x10(%rax),%r11 > ddb{0}> show panic > kernel page fault > uvm_fault(0x81f508f8, 0x9cff81d46840, 0, 1) -> e > pool_gc_pages(0) at pool_gc_pages+0x67 > end trace frame: 0x800022043c20, count: 0 > ddb{0}> trace > pool_gc_pages(0) at pool_gc_pages+0x67 > taskq_thread(81f1f948) at taskq_thread+0x4d > end trace frame: 0x0, count: -2 > ddb{0}> mach ddbcpu 1 > Stopped at x86_ipi_db+0x12:leave > ddb{1}> show panic > kernel page fault > uvm_fault(0x81f508f8, 0x9cff81d46840, 0, 1) -> e > x86_ipi_db(800022000ff0) at x86_ipi_db+0x12 > end trace frame: 0x80002204f3e0, count: 0 > ddb{1}> trace > x86_ipi_db(800022000ff0) at x86_ipi_db+0x12 > x86_ipi_handler() at x86_ipi_handler+0x80 > Xresume_lapic_ipi(0,0,1388,0,8007b4e0,8000220016f8) at > Xresume_lapi > c_ipi+0x23 > acpicpu_idle() at acpicpu_idle+0x14d > sched_idle(800022000ff0) at sched_idle+0x225 > end trace frame: 0x0, count: -5 > ddb{1}> ps >PID TID PPIDUID S FLAGS WAIT COMMAND > 95276 13988 1 0 30x80 kqreadrelayd > 26963 27946 1 89 30x100092 kqreadrelayd > 55989 95900 1 89 30x100092 kqreadrelayd > 9916 26805 1 89 30x100092 kqreadrelayd > 95143 161715 1 89 30x100092 kqreadrelayd > 40331 479376 1 89 30x100092 kqreadrelayd > 30572 497829 1 89 30x100092 kqreadrelayd > 66569 297122 1 89 30x100092 kqreadrelayd > 581746892 1 89 30x100092 kqreadrelayd > 9074 397819 1 89 30x100092 kqreadrelayd > 3400 372150 1 89 30x100092 kqreadrelayd > 20384 188200 1 89 30x100092 kqreadrelayd > 35091 435040 1 89 30x100092 kqreadrelayd > 61458 99434 70149 1000 30x100083 ttyin more > 70149 217271 94483 1000 30x100083 wait man > 94483 387711 6290 1000 30x10008b pause ksh > 70627 315246 76360 0 30x100083 piperdgrep > 34778 359396 76360 0 30x100083 kqreadtail > 76360 498368 91100 0 30x10008b pause ksh > 91100 335621 6290 1000 30x10008b pause ksh > 36467 185961 75892 0 30x100083 ttyin ksh > 75892 445733 6290 1000 30x10008b pause ksh > 44772 43154 53744 0 30x100083 ttyin ksh > 53744 87181 6290 1000 30x10008b pause ksh > 6290 420923 1 1000 30x100080 kqreadtmux > 32341 184709 6 1000 30x100083 kqreadtmux > 6 387915 87108 1000 30x10008b pause ksh > 87108 302555 62336 1000 30x90 selectsshd > 62336 287615 11348 0 30x92 poll sshd > 7752 285805 99541 83 30x100092 poll ntpd > 99541 162749 99451 83 30x100092 poll ntpd > 99451 435657 1 0 30x100080 poll ntpd > 93391 41680 60584 53 30x90 kqreadunbound > 60584 394085 1 53 30x90 kqreadunbound > 44674 492072 13228 1000 30x1000b3 poll ping > 759728907 1750 30x81 nanosleep perl > 85218 71844 0 0 3 0x14200 bored sosplice > 13228 249507 1 1000 30x10008b pause ksh > 6956 109690 1 0 30x100098 poll cron > 11885 194741 1 0 30x80 nanosleep collectd > 11885 75930 1 0 3 0x480 fsleepcollectd > 11885 472585 1 0 3 0x40
Re: arpresolve: XX: route contains no arp information
On 09/04/20(Thu) 20:22, Laurent Salle wrote: > On 08/04/2020 06.52, Martin Pieuchot wrote: > > > It's the same bug as reported by sthen@. Two interfaces in the same subnet > > have two identical cloning routes: > > I've been able to reproduce systematically the problem with an OpenBSD > virtual machine running the latest snapshot and two vio interface with > different priority connected to the same lan with dhcp. Thanks for the report! Diff below seems to fix the issue here, could you try it? Index: netinet/if_ether.c === RCS file: /cvs/src/sys/netinet/if_ether.c,v retrieving revision 1.242 diff -u -p -r1.242 if_ether.c --- netinet/if_ether.c 7 Nov 2019 11:23:23 - 1.242 +++ netinet/if_ether.c 10 Apr 2020 08:45:42 - @@ -559,6 +559,23 @@ in_arpinput(struct ifnet *ifp, struct mb KERNEL_LOCK(); error = arpcache(ifp, ea, rt); + if (error == 0 && ISSET(rt->rt_flags, RTF_CACHED)) { + /* +* RTF_CACHED entry are not deleted as long as +* their parent gateway route is alive, so make +* sure to update its sibling which might be on +* a different interface to not leave them as +* unresolved. +*/ + while ((rt = rtable_iterate(rt)) != NULL) { + struct ifnet *ifp0; + + ifp0 = if_get(rt->rt_ifidx); + if (ifp0 != NULL) + error = arpcache(ifp0, ea, rt); + if_put(ifp0); + } + } KERNEL_UNLOCK(); if (error) goto out;
Re: 6.6-current stutters after heavy disk loads
On 10/04/20(Fri) 09:42, Martin wrote: > Have you found anything regarding the issue? No I haven't. > Now I have time to add dt(4) in conf/GENERIC build & install a new kernel, > build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf. > > Looks like dt(4) is a part of -current, but I can't move to -current right > now. I'm going to do it once 6.7 is released. Thanks, Martin
Re: arpresolve: XX: route contains no arp information
On 10/04/20(Fri) 11:18, Claudio Jeker wrote: > On Fri, Apr 10, 2020 at 10:47:53AM +0200, Martin Pieuchot wrote: > > On 09/04/20(Thu) 20:22, Laurent Salle wrote: > > > On 08/04/2020 06.52, Martin Pieuchot wrote: > > > > > > > It's the same bug as reported by sthen@. Two interfaces in the same > > > > subnet > > > > have two identical cloning routes: > > > > > > I've been able to reproduce systematically the problem with an OpenBSD > > > virtual machine running the latest snapshot and two vio interface with > > > different priority connected to the same lan with dhcp. > > > > Thanks for the report! Diff below seems to fix the issue here, could > > you try it? > > I'm not convinced that this is the right solution. In your diff you insert > the MAC received on one interface into the arp node of another interface. > This feels wrong, arp entries should never cross over interfaces. > For example if for some reasons the two interfaces have the same gateway > IP but use different MACs for that IP then this breaks. Makes sense. Well it looks like when the default route on if0 tries to use the L2 route underneath it, the ARP layer resolve the entry on if1 instead of on if0. The route on if0 is being used because it has higher priority, however the L2 entry on if1 has been inserted first. I haven't debugged further.
Re: arpresolve: XX: route contains no arp information
On 10/04/20(Fri) 13:19, Claudio Jeker wrote: > On Fri, Apr 10, 2020 at 12:14:17PM +0200, Martin Pieuchot wrote: > > On 10/04/20(Fri) 11:18, Claudio Jeker wrote: > > > On Fri, Apr 10, 2020 at 10:47:53AM +0200, Martin Pieuchot wrote: > > > > On 09/04/20(Thu) 20:22, Laurent Salle wrote: > > > > > On 08/04/2020 06.52, Martin Pieuchot wrote: > > > > > > > > > > > It's the same bug as reported by sthen@. Two interfaces in the > > > > > > same subnet > > > > > > have two identical cloning routes: > > > > > > > > > > I've been able to reproduce systematically the problem with an OpenBSD > > > > > virtual machine running the latest snapshot and two vio interface with > > > > > different priority connected to the same lan with dhcp. > > > > > > > > Thanks for the report! Diff below seems to fix the issue here, could > > > > you try it? > > > > > > I'm not convinced that this is the right solution. In your diff you insert > > > the MAC received on one interface into the arp node of another interface. > > > This feels wrong, arp entries should never cross over interfaces. > > > For example if for some reasons the two interfaces have the same gateway > > > IP but use different MACs for that IP then this breaks. > > > > Makes sense. > > > > Well it looks like when the default route on if0 tries to use the L2 > > route underneath it, the ARP layer resolve the entry on if1 instead of > > on if0. > > > > The route on if0 is being used because it has higher priority, however > > the L2 entry on if1 has been inserted first. I haven't debugged > > further. > > Yes, this comes from the fact that rtalloc() will find the gw route of the > wrong interface and not clone a new entry from the other interface and so > the rt_gwroute cache is all messed up. Do you know which particular rtalloc(9) we're talking about?
Re: [pc engines apu1d4] kernel crash periodically
On 10/04/20(Fri) 12:18, Pascal Cabaud wrote: > Hello Martin, > > Thanks for searching bits in this bug report. Are the APU really popular > to run OpenBSD or is there a problem with them: AFAICS, i've found many > reports in archives... The problem is unlikely to be related to the hardware. > Le 2020-04-10 09:58, Martin Pieuchot disait : > > Thanks for the report. This looks to me like a memory corruption in > > the kernel. This is hard to understand because the panic(9) only show > > the symptom, not the actual bug. > > Ok, I'm recording console with GNU Screen. Let's wait... > > To play with daemons, it'll be more difficult, i've to find backup > hardware first. I understand, however wouldn't be surprised if the issue is exposed by one of the daemon doing a lot of stuff and dealing with network, relayd or collectd maybe.
Re: splassert w/ add/del vlan on bridge
On 11/04/20(Sat) 23:09, David Gwynne wrote: > On Sat, Apr 11, 2020 at 03:21:49AM +, Visa Hankala wrote: > > On Fri, Apr 10, 2020 at 01:30:47PM -0600, Theo de Raadt wrote: > > > Why did it take almost a year to find this? > > > > > > Or is this bug due to ioctl(2) becoming UNLOCKED on 2020/02/22? > > > > This is not related to ioctl(2) becoming UNLOCKED. Lower-layer ioctl > > code, soo_ioctl() included, lock the kernel when needed. However, most > > .if_ioctl backends need NET_LOCK() in addition to KERNEL_LOCK(). In > > most cases, that is satisfied by ifioctl() which acquires the lock > > before invoking .if_ioctl(). bridge_ioctl() nullifies this by > > releasing NET_LOCK(). > > yes. > > i came up with the following diff before i read the thread here. it's > largely identical to what you (visa) already came up with, but it adds > some extra checks to ifpromisc based on the doco in around struct ifnet > members in src/sys/net/if_var.h. i audited the rest of the ifpromisc > calls and found another one in if_aggr that i was able to trigger. The documentation says `if_pcount' is protected by the KERNEL_LOCK() but in fact it is only read & modified in ifpromisc(). So I'd suggest fixing the documentation and not add another assert there. > i think the only other call to ifpromisc outside src/sys/net is in carp, > and i managed to convinced myself that all those calls hold NET_LOCK > already. > > Index: if.c > === > RCS file: /cvs/src/sys/net/if.c,v > retrieving revision 1.601 > diff -u -p -r1.601 if.c > --- if.c 10 Mar 2020 09:11:55 - 1.601 > +++ if.c 11 Apr 2020 13:08:46 - > @@ -3031,7 +3031,9 @@ ifpromisc(struct ifnet *ifp, int pswitch > unsigned short oif_flags; > int oif_pcount, error; > > + NET_ASSERT_LOCKED(); /* modifying if_flags */ > oif_flags = ifp->if_flags; > + KERNEL_ASSERT_LOCKED(); /* modifying if_pcount */ > oif_pcount = ifp->if_pcount; > if (pswitch) { > if (ifp->if_pcount++ != 0) > Index: if_aggr.c > === > RCS file: /cvs/src/sys/net/if_aggr.c,v > retrieving revision 1.28 > diff -u -p -r1.28 if_aggr.c > --- if_aggr.c 11 Mar 2020 07:01:42 - 1.28 > +++ if_aggr.c 11 Apr 2020 13:08:46 - > @@ -589,8 +589,10 @@ aggr_clone_destroy(struct ifnet *ifp) > if_detach(ifp); > > /* last ref, no need to lock. aggr_p_dtor locks anyway */ > + NET_LOCK(); > while ((p = TAILQ_FIRST(&sc->sc_ports)) != NULL) > aggr_p_dtor(sc, p, "destroy"); > + NET_UNLOCK(); > > free(sc, M_DEVBUF, sizeof(*sc)); > > Index: if_bridge.c > === > RCS file: /cvs/src/sys/net/if_bridge.c,v > retrieving revision 1.338 > diff -u -p -r1.338 if_bridge.c > --- if_bridge.c 6 Nov 2019 03:51:26 - 1.338 > +++ if_bridge.c 11 Apr 2020 13:08:46 - > @@ -313,7 +313,9 @@ bridge_ioctl(struct ifnet *ifp, u_long c > break; > } > > + NET_LOCK(); > error = ifpromisc(ifs, 1); > + NET_UNLOCK(); > if (error != 0) { > free(bif, M_DEVBUF, sizeof(*bif)); > break; > @@ -558,7 +560,9 @@ bridge_ifremove(struct bridge_iflist *bi > } > > bif->ifp->if_bridgeidx = 0; > + NET_LOCK(); > error = ifpromisc(bif->ifp, 0); > + NET_UNLOCK(); > > bridge_rtdelete(sc, bif->ifp, 0); > bridge_flushrule(bif); > Index: if_tpmr.c > === > RCS file: /cvs/src/sys/net/if_tpmr.c,v > retrieving revision 1.9 > diff -u -p -r1.9 if_tpmr.c > --- if_tpmr.c 11 Apr 2020 11:01:03 - 1.9 > +++ if_tpmr.c 11 Apr 2020 13:08:46 - > @@ -201,12 +201,14 @@ tpmr_clone_destroy(struct ifnet *ifp) > > if_detach(ifp); > > + NET_LOCK(); > for (i = 0; i < nitems(sc->sc_ports); i++) { > struct tpmr_port *p = SMR_PTR_GET_LOCKED(&sc->sc_ports[i]); > if (p == NULL) > continue; > tpmr_p_dtor(sc, p, "destroy"); > } > + NET_UNLOCK(); > > free(sc, M_DEVBUF, sizeof(*sc)); > >
Re: i915/drm vs WITNESS
On 26/02/20(Wed) 17:39, Mark Kettenis wrote: > > Date: Wed, 12 Feb 2020 15:24:46 +0100 > > From: Martin Pieuchot > > Haven't forgotten about these. The following are still present on 6.7-beta built from today's sources: witness: lock order reversal: 1st 0x8136a880 &rq->lock (&rq->lock) 2nd 0x806a9050 rcs0 (&timeline->lock) lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 __i915_request_submit+0x5b #3 __execlists_submission_tasklet+0x1b9 #4 execlists_submit_request+0x1d1 #5 submit_notify+0x37 #6 __i915_sw_fence_complete+0x40 #7 i915_request_add+0x2d3 #8 i915_gem_init+0x2b9 #9 i915_driver_load+0x815 #10 inteldrm_attachhook+0x2c #11 config_process_deferred_mountroot+0x6b #12 main+0x75a lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 execlists_submit_request+0x2a #3 submit_notify+0x37 #4 __i915_sw_fence_complete+0x40 #5 dma_i915_sw_fence_wake+0x1d #6 notify_ring+0x1a8 #7 gen8_gt_irq_handler+0xba #8 gen8_irq_handler+0x114 #9 intr_handler+0x6e #10 Xintr_ioapic_edge16_untramp+0x19f #11 acpicpu_idle+0x1d2 #12 sched_idle+0x225 #13 proc_trampoline+0x1c witness: lock order reversal: 1st 0x8136b150 &wqh->lock (&wqh->lock) 2nd 0x806a9050 rcs0 (&timeline->lock) lock order "&wqh->lock"(mutex) -> "&timeline->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 execlists_submit_request+0x2a #3 submit_notify+0x37 #4 __i915_sw_fence_complete+0x40 #5 i915_sw_fence_wake+0x39 #6 __i915_sw_fence_complete+0x131 #7 dma_i915_sw_fence_wake+0x1d #8 notify_ring+0x1a8 #9 gen8_gt_irq_handler+0xba #10 gen8_irq_handler+0x114 #11 intr_handler+0x6e #12 Xintr_ioapic_edge16_untramp+0x19f witness: acquiring duplicate lock of same type: "&wqh->lock" 1st &wqh->lock 2nd &wqh->lock Starting stack trace... witness_checkorder(8136bc30,9,0) at witness_checkorder+0x6ba mtx_enter(8136bc20) at mtx_enter+0x34 __i915_sw_fence_complete(8136bc20,800033a6fc70) at __i915_sw_fence_complete+0x58 i915_sw_fence_wake(8136bc78,1,0,800033a6fc70) at i915_sw_fence_wake+0x39 __i915_sw_fence_complete(8136b140,0) at __i915_sw_fence_complete+0x131 dma_i915_sw_fence_wake(8136a008,8137f420) at dma_i915_sw_fence_wake+0x1d notify_ring(80a84000) at notify_ring+0x1a8 gen8_gt_irq_handler(80154000,2,800033a6fdb0) at gen8_gt_irq_handler+0xba gen8_irq_handler(0,80154078) at gen8_irq_handler+0x114 intr_handler(800033a6fe50,80144d80) at intr_handler+0x6e Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x19f end of kernel end trace frame: 0x7f7def90, count: 246 End of stack trace. witness: lock order reversal: 1st 0x8136bb88 &rq->lock (&rq->lock) 2nd 0x80a84050 bcs0 (&timeline->lock) lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 __i915_request_submit+0x5b #3 __execlists_submission_tasklet+0x1b9 #4 execlists_submit_request+0x1d1 #5 submit_notify+0x37 #6 __i915_sw_fence_complete+0x40 #7 i915_request_add+0x2d3 #8 i915_gem_init+0x2cb #9 i915_driver_load+0x815 #10 inteldrm_attachhook+0x2c #11 config_process_deferred_mountroot+0x6b #12 main+0x75a lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at: #0 witness_checkorder+0x449 #1 mtx_enter+0x34 #2 execlists_submit_request+0x2a #3 submit_notify+0x37 #4 __i915_sw_fence_complete+0x40 #5 dma_i915_sw_fence_wake+0x1d #6 notify_ring+0x1a8 #7 gen8_gt_irq_handler+0x55 #8 gen8_irq_handler+0x114 #9 intr_handler+0x6e #10 Xintr_ioapic_edge16_untramp+0x19f #11 Xspllower+0x19 #12 mtx_enter_try+0x98 #13 mtx_enter+0x4a #14 i915_vma_move_to_active+0x427 #15 i915_gem_do_execbuffer+0xb09 #16 i915_gem_execbuffer2_ioctl+0x144 #17 drmioctl+0xdc #18 VOP_IOCTL+0x55
Re: Intermittent crashes on 6.5-stable with PC Engines APU2D4
On 14/10/19(Mon) 16:17, Alexander Bluhm wrote: > On Fri, Oct 11, 2019 at 01:19:02PM +, L??vai, D??niel wrote: > > uvm_fault(0xfd8124d90960, 0x7f884cecdcf8, 0, 2) -> e ^^ Do I understand correctly that the faulting page is 0x7f884cecd000? PTE_BASE corresponds to 0x7f80, the VA in the fault above should be 0x84cecdcf8000, in bluhm@'s report 0x27ea48908000. Both reports involve multi-threaded programs. Alexander what is the CPU of the machine where you can reproduce the bug? Are we trying to understand how a page storing PTEs can generate a fault? Is it what the traces say or am I completely on a wrong track? > > kernel: page fault trap, code=0 > > Stopped at pmap_page_remove+0x210: xchgq %rax,0(%rcx,%rdx,1) > > > ddb{3}> trace > > pmap_page_remove(fd800975d480) at pmap_page_remove+0x210 > > uvm_anfree(fd8125d62b10) at uvm_anfree+0x36 > > amap_wipeout(fd8123d95170) at amap_wipeout+0xe5 > > uvm_unmap_detach(800022420fe8,0) at uvm_unmap_detach+0x90 > > sys_munmap(800022233cb8,800022421060,8000224210d0) at > > sys_munmap+0x11d > > syscall(800022421140) at syscall+0x305 > > Xsyscall(6,49,109a8d931e10,49,109a58e72150,1099d9b9f000) at Xsyscall+0x128 > > end of kernel > > end trace frame: 0x109a82dffa50, count: -7 > > I see this bug for a while now. > > https://marc.info/?l=openbsd-bugs&m=156399483018833&w=2 > > I can trigger it by running /usr/src/regress/lib/libpthread/malloc_duel > for some hours. Moritz Buhl has tried to bisect the problem and > it appears to exists since January 2019. But it is hard to be sure > as reproducing takes a while. It is also unclear whether the change > in behavior is caused by compiler, kernel, libc, libpthread or > malloc_duel. We could not trigger it with OpenBSD 6.4. > > bluhm >
Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed
Thanks for the report. On 18/04/20(Sat) 18:17, Julian Brost wrote: > I encountered a reproducible kernel panic during an accidental IPv6 > misconfiguration. In order to reproduce, the OpenBSD machine must be in > the same subnet as a router that has fe80::1/64 configured and sends > IPv6 route advertisements, for example with radvd using this config: > > interface eth0 { > AdvSendAdvert on; > MinRtrAdvInterval 10; > MaxRtrAdvInterval 30; > prefix 2001:db8::/64 { > AdvOnLink on; > AdvAutonomous on; > AdvRouterAddr on; > }; > }; > > With this setup, I was able to to reliably trigger the assertion using > the following steps: > > - Install Openbsd using 6.6/amd64 install66.iso > - IPv4: none > - IPv6: autoconf > - Reboot into system, log in > - echo inet6 alias fe80::1 64 >> /etc/hostname.vio0 > # The file now contains the following: > # inet6 autoconf > # inet6 alias fe80::1 64 > - Reboot and log in again > - ping6 2001:: > # The exact address doesn't seem to matter, it also doesn't have to > # respond or anything. Sometimes this step isn't even necessary as the > # panic occurs by itself after the login prompt. > - Wait a bit (less than a minute in my case) and observe the panic > [...] > vio0: DAD detected duplicate IPv6 address fe80:1::1: NS in/out=0/1, NA in=1 > vio0: DAD complete for fe80:1::1 - duplicate found > vio0: manual intervention required Interesting :) > login: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, > RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 727 That means some part of the ND code is incorrectly setting an `expire' value to an entry that is local, and therefor should never expire. Could you try to reproduce the issue with the diff below? It should also panic but points us to the place where the bug is. Index: netinet6/nd6.c === RCS file: /cvs/src/sys/netinet6/nd6.c,v retrieving revision 1.229 diff -u -p -r1.229 nd6.c --- netinet6/nd6.c 29 Nov 2019 16:41:01 - 1.229 +++ netinet6/nd6.c 20 Apr 2020 10:07:15 - @@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l time_t expire = time_uptime + secs; NET_ASSERT_LOCKED(); + KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL)); ln->ln_rt->rt_expire = expire; if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) {
Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed
On 20/04/20(Mon) 14:27, Julian Brost wrote: > On 2020-04-20 12:14, Martin Pieuchot wrote: > >> login: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, > >> RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 727 > > > > That means some part of the ND code is incorrectly setting an `expire' > > value to an entry that is local, and therefor should never expire. > > > > Could you try to reproduce the issue with the diff below? It should > > also panic but points us to the place where the bug is. > > > > [...] > With the diff applied, this is the panic message: > > starting network > vio0: DAD detected duplicate IPv6 address fe80:1::1: NS in/out=0/1, NA in=1 > vio0: DAD complete for fe80:1::1 - duplicate found > vio0: manual intervention required > reordering libraries:ndp info overwritten for fe80:1::1 by > 76:fa:d3:57:ec:56 on vio0 > panic: kernel diagnostic assertion "!ISSET(ln->ln_rt->rt_flags, > RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 309 > Stopped at db_enter+0x10: popq%rbp > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > *457148 43436 0 0x14000 0x2000 softnet > db_enter() at db_enter+0x10 > panic() at panic+0x128 > __assert(81c8d6ea,81c94c17,135,81c9fb69) at > __assert+0x > 2b > > nd6_llinfo_settimer(fd803ec6ff00,15180) at nd6_llinfo_settimer+0xdf > nd6_cache_lladdr(800972a8,800014a496a0,fd803714f874,8,86,0) > at n > d6_cache_lladdr+0x2be > > nd6_rtr_cache(fd8036ee3000,28,38,86) at nd6_rtr_cache+0x31e > icmp6_input(800014a499d8,800014a499e4,3a,18) at icmp6_input+0x33d > ip_deliver(800014a499d8,800014a499e4,3a,18) at ip_deliver+0x1b3 > ip6_input_if(800014a499d8,800014a499e4,29,0,800972a8) at Thanks, diff below fixes nd6_rtr_cache(). It was already skipping static entries the same should be done for local entries. I left the panic in there in case there's another place where the bug can be triggered. Index: netinet6/nd6.c === RCS file: /cvs/src/sys/netinet6/nd6.c,v retrieving revision 1.229 diff -u -p -r1.229 nd6.c --- netinet6/nd6.c 29 Nov 2019 16:41:01 - 1.229 +++ netinet6/nd6.c 21 Apr 2020 08:36:01 - @@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l time_t expire = time_uptime + secs; NET_ASSERT_LOCKED(); + KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL)); ln->ln_rt->rt_expire = expire; if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) { @@ -,17 +1112,11 @@ nd6_cache_lladdr(struct ifnet *ifp, stru rt = nd6_lookup(from, 0, ifp, ifp->if_rdomain); if (rt == NULL) { -#if 0 - /* nothing must be done if there's no lladdr */ - if (!lladdr || !lladdrlen) - return NULL; -#endif - rt = nd6_lookup(from, 1, ifp, ifp->if_rdomain); is_newentry = 1; } else { - /* do nothing if static ndp is set */ - if (rt->rt_flags & RTF_STATIC) { + /* do not overwrite local or static entry */ + if (ISSET(rt->rt_flags, RTF_STATIC|RTF_LOCAL)) { rtfree(rt); return; }
Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed
On 20/04/20(Mon) 15:44, Anton Lindqvist wrote: > > Index: netinet6/nd6.c > > === > > RCS file: /cvs/src/sys/netinet6/nd6.c,v > > retrieving revision 1.229 > > diff -u -p -r1.229 nd6.c > > --- netinet6/nd6.c 29 Nov 2019 16:41:01 - 1.229 > > +++ netinet6/nd6.c 20 Apr 2020 10:07:15 - > > @@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l > > time_t expire = time_uptime + secs; > > > > NET_ASSERT_LOCKED(); > > + KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL)); > > > > ln->ln_rt->rt_expire = expire; > > if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) { > > > > Also found by syzkaller. > > https://syzkaller.appspot.com/bug?extid=0eb994ff432ae75e3369 Maybe, maybe not. Since the KASSERT() is in a timer we cannot be sure the entry has been inserted in the global list by the same code path. So it's hard to say if this is the same bug.
pty leak or corruption w/ openpty + dup2?
Program below is the smaller version of a syzkaller report [0]. After running it one is left without usable console. A second execution will make openpty(3) pick a different "/dev/tty*" node: 50361 crashCALL ioctl(3,PTMGET,0x7f7eda80) 50361 crashNAMI "/dev/ptypd" 50361 crashNAMI "/dev/ttypd" 50361 crashNAMI "/dev/ttypd" 50361 crashRET ioctl 0 After some more tries: 65559 crashCALL ioctl(3,PTMGET,0x7f7c36a0) 65559 crashNAMI "/dev/ptypm" 65559 crashNAMI "/dev/ttypm" 65559 crashNAMI "/dev/ttypm" 65559 crashRET ioctl 0 [0] https://syzkaller.appspot.com/bug?id=a74718ca902617e6aa7327aa008b25844eccf2d3 - crash.c - #include #include int main(void) { char garbage[100]; int master, slave; if (openpty(&master, &slave, NULL, NULL, NULL) == -1) return -1; if (dup2(master, master + 100) != -1) close(master); write(slave, garbage, 99); return 0; }
Re: Xorg hangs on recent snapshots
Hello Mark, Thanks for the report. On 01/05/20(Fri) 16:51, Mark Patruck wrote: > Problem: > > With amdgpu(4) enabled, everything runs fine and smooth for minutes, > sometimes hours (especially if you don't start lots of programs), but > all of a sudden X freezes. That means, you can move your mouse, ssh in, > also top and other programs are still running, but you have to kill -9 > X, to get back to business. This only applies for Polaris 11-see Results > below. Such 'freeze' is a symptom. If you can ssh into the machine when that happens a useful piece of informations would be the output of: # ps -Sx -Owchan similarly the output of "ps -S" would show where current threads are blocking. Another interesting piece of information would be the output of 'dmesg' at that given moment. The kernel might have printed some valuable informations when something wrong happens. Maybe /var/log/Xorg.0.log would also contain valuable informations. These pieces of information might help us pinpoint the underlying problem. > [...] > I know about this thread on freedesktop.org [1], but again... > before buying sth new, i'd like to know about your findings. > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=105733#c75 Do you know if it's the issue you're experiencing?
Re: pty leak or corruption w/ openpty + dup2?
On 01/05/20(Fri) 12:13, Anton Lindqvist wrote: > The order in which the pty master/slave is closed seems to be the > trigger here. While not duping the master, it's closed before the slave. > In the opposite scenario, the slave is closed before the master. While > closing the slave, it ends up here expressed as a simplified backtrace: > > tsleep() > ttysleep() > ttywait() > ttywflush() > ttylclose() > ptsclose() > fdfree() > exit1() > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to > hang. Note that this is not the case when the master is closed before > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early. Why is the sleeper never awaken? Does that mean a ttwakeup() is missing? > NetBSD does a sleep with a timeout in ttywflush(). I've applied the same > approach in the diff below which does fix the hang. This seems like a racy workaround for a bug that we do not fully understand. If this is a proper solution I'd be happy to understand why. If we go with such fix we should be using a value in "nsecs" instead of ticks and INFSLP should be used instead of 0. We should refrain from introducing new usages of `hz' ;)
Re: pty leak or corruption w/ openpty + dup2?
On 02/05/20(Sat) 10:40, Anton Lindqvist wrote: > On Fri, May 01, 2020 at 05:17:36PM +0200, Martin Pieuchot wrote: > > On 01/05/20(Fri) 12:13, Anton Lindqvist wrote: > > > The order in which the pty master/slave is closed seems to be the > > > trigger here. While not duping the master, it's closed before the slave. > > > In the opposite scenario, the slave is closed before the master. While > > > closing the slave, it ends up here expressed as a simplified backtrace: > > > > > > tsleep() > > > ttysleep() > > > ttywait() > > > ttywflush() > > > ttylclose() > > > ptsclose() > > > fdfree() > > > exit1() > > > > > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to > > > hang. Note that this is not the case when the master is closed before > > > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early. > > > > Why is the sleeper never awaken? Does that mean a ttwakeup() is missing? > > In this case, the process is single threaded, about to exit and the only > consumer of the pty. I don't see how it could be any other process > responsibility to perform the wakeup. Do we see that the issue is caused by the order in which descriptors are closed in fdfree()? The current deadlock occurs because the duped master has a higher fd number than the slave which means it is still open when the slave is closed. But why would that be a problem? By default *close() functions, including ttylclose() are blocking. So any exiting process might end up hanging in fdfree(). Diff below illustrates that by forcing all *close() during exit1() to be non-blocking, it also fix the issue. Does it make sense to close fds as non-blocking when existing? What should a dying thread wait for? What can be the cons of such approach? Now regarding your fix, why does it make sense to wait 5sec instead of indefinitely? Did you look at r1.263 of NetBSD's kern/tty.c? If we go with this change could you please change the 'timo' suffix and variables to 'nsec' and use uint64_t instead of int? Index: kern/vfs_vnops.c === RCS file: /cvs/src/sys/kern/vfs_vnops.c,v retrieving revision 1.114 diff -u -p -r1.114 vfs_vnops.c --- kern/vfs_vnops.c8 Apr 2020 08:07:51 - 1.114 +++ kern/vfs_vnops.c2 May 2020 09:18:28 - @@ -601,6 +601,7 @@ vn_closefile(struct file *fp, struct pro { struct vnode *vp = fp->f_data; struct flock lf; + unsigned int flag; int error; KERNEL_LOCK(); @@ -611,7 +612,10 @@ vn_closefile(struct file *fp, struct pro lf.l_type = F_UNLCK; (void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK); } - error = vn_close(vp, fp->f_flag, fp->f_cred, p); + flag = fp->f_flag; + if (p != NULL && p->p_flag & P_WEXIT) + flag |= O_NONBLOCK; + error = vn_close(vp, flag, fp->f_cred, p); KERNEL_UNLOCK(); return (error); }
Re: pty leak or corruption w/ openpty + dup2?
On 02/05/20(Sat) 16:02, Mark Kettenis wrote: > > Date: Sat, 2 May 2020 11:33:17 +0200 > > From: Martin Pieuchot > > [...] > > Do we see that the issue is caused by the order in which descriptors are > > closed in fdfree()? The current deadlock occurs because the duped master > > has a higher fd number than the slave which means it is still open when the > > slave is closed. > > I'm sure we could construct an example where the file descriptors are > in a different oder. So changing the order is not going to help. Obviously :) > > But why would that be a problem? By default *close() functions, > > including ttylclose() are blocking. So any exiting process might end up > > hanging in fdfree(). Diff below illustrates that by forcing all *close() > > during exit1() to be non-blocking, it also fix the issue. > > I very much fear that is going to have unintended side-effects with > output not being flushed properly. And the process could still > deadlock itself by using close(2) directly isn't it? Indeed. > > Does it make sense to close fds as non-blocking when existing? What > > should a dying thread wait for? What can be the cons of such approach? > > > > Now regarding your fix, why does it make sense to wait 5sec instead of > > indefinitely? Did you look at r1.263 of NetBSD's kern/tty.c? If we go > > with this change could you please change the 'timo' suffix and variables > > to 'nsec' and use uint64_t instead of int? > > r1.263 was reverted in r1.264. Then r1.265 is the commit quoted by anton@. > There is also r2.267 which adds an additional fix to r1.265. > > In ttywait(), NetBSD only calls ttyflush() if there is a timeout. That > makes sense, because we have ttywflush() to combine the wait and flush > so ttywait() shouldn't flush when there is no error. Updated diff below reflecting those changes. I'm still questioning the 5sec timeout, but it is without doubt an improvement over the current behavior. The previously mentioned test as well as a modified version closing the slave before exit(2) now hang for 5 seconds instead of deadlocking indefinitely. I believe we want that for release, ok? Index: kern/tty.c === RCS file: /cvs/src/sys/kern/tty.c,v retrieving revision 1.154 diff -u -p -r1.154 tty.c --- kern/tty.c 7 Apr 2020 13:27:51 - 1.154 +++ kern/tty.c 6 May 2020 07:44:53 - @@ -80,6 +80,8 @@ void filt_ttyrdetach(struct knote *kn); intfilt_ttywrite(struct knote *kn, long hint); void filt_ttywdetach(struct knote *kn); void ttystats_init(struct itty **, size_t *); +intttywait_nsec(struct tty *tp, uint64_t nsecs); +intttysleep_nsec(struct tty *, void *, int, char *, uint64_t); /* Symbolic sleep message strings. */ char ttclos[] = "ttycls"; @@ -1202,10 +1204,10 @@ ttnread(struct tty *tp) } /* - * Wait for output to drain. + * Wait for output to drain, or if this times out, flush it. */ int -ttywait(struct tty *tp) +ttywait_nsec(struct tty *tp, uint64_t nsecs) { int error, s; @@ -1219,7 +1221,10 @@ ttywait(struct tty *tp) (ISSET(tp->t_state, TS_CARR_ON) || ISSET(tp->t_cflag, CLOCAL)) && tp->t_oproc) { SET(tp->t_state, TS_ASLEEP); - error = ttysleep(tp, &tp->t_outq, TTOPRI | PCATCH, ttyout); + error = ttysleep_nsec(tp, &tp->t_outq, TTOPRI | PCATCH, + ttyout, nsecs); + if (error == EWOULDBLOCK) + ttyflush(tp, FWRITE); if (error) break; } else @@ -1229,6 +1234,12 @@ ttywait(struct tty *tp) return (error); } +int +ttywait(struct tty *tp) +{ + return (ttywait_nsec(tp, INFSLP)); +} + /* * Flush if successfully wait. */ @@ -1237,7 +1248,8 @@ ttywflush(struct tty *tp) { int error; - if ((error = ttywait(tp)) == 0) + error = ttywait_nsec(tp, SEC_TO_NSEC(5)); + if (error == 0 || error == EWOULDBLOCK) ttyflush(tp, FREAD); return (error); } @@ -2281,11 +2293,18 @@ tputchar(int c, struct tty *tp) int ttysleep(struct tty *tp, void *chan, int pri, char *wmesg) { + + return (ttysleep_nsec(tp, chan, pri, wmesg, INFSLP)); +} + +int +ttysleep_nsec(struct tty *tp, void *chan, int pri, char *wmesg, uint64_t nsecs) +{ int error; short gen; gen = tp->t_gen; - if ((error = tsleep_nsec(chan, pri, wmesg, INFSLP)) != 0) + if ((error = tsleep_nsec(chan, pri, wmesg, nsecs)) != 0) return (error); return (tp->t_gen == gen ? 0 : ERESTART); }
wsemul_vt100 & wsmux's ioctl rwlock taken in interrupt context
Following backtrace found by robert@'s syzkaller exposes a context / locking issue related to wsmux's ioctl rwlock: panic: acquiring blockable sleep lock with spinlock or critical section held (rwlock) wsmuxlk trace: panic+0x15c witness_checkorder+0x10e0 rw_enter_read+0x66 wsmux_do_displayioctl+0x7e wsdisplay_emulbell+0x68 wsemul_vt100_output_c0c1+0x2f5 wsemul_vt100_output+0x34e wsdisplaystart+0x396 ttrstrt+0x4b timeout_run+0xc4 softclock+0x175 softintr_dispatch+0x107 Xsoftclock+0x1f Grabbing `sc_lock' should obviously not be possible from softclock context. I'm not sure what's the best way to fix this issue. timeout_set_proc(9) will make the warning disappear but is it the right thing to do? Is there other interrupt-context paths that can enter this code? The lock has been introduced to prevent access to `sc_cld' in case a thread was sleeping in the middle of an operation. Are we sure those sleeping points cannot be reached by entry points from interrupt context? Did we consider alternative fixes than a lock?
Re: OpenBSD 6.7 crashes on APU2C4 with LTE modem Huawei E3372s-153 HiLink
On 25/05/20(Mon) 12:56, Gerhard Roth wrote: > On 5/22/20 9:05 PM, Mark Kettenis wrote: > > > From: Łukasz Lejtkowski > > > Date: Fri, 22 May 2020 20:51:57 +0200 > > > > > > Probably power supply 12 V is broken. Showing 16,87 V(Fluke 179) - > > > too high. Should be 12,25-12,50 V. I replaced to the new one. > > > > That might be why the device stops responding. The fact that cleaning > > up from a failed USB transaction leads to this panic is a bug though. > > > > And somebody just posted a very similar panic with ure(4). Something > > in the network stack is holding a mutex when it shouldn't. > > I think that holding the mutex is ok. The bug is calling the stop > routine in case of errors. > > This is what common foo_start() does: > > m_head = ifq_deq_begin(&ifp->if_snd); > if (foo_encap(sc, m_head, 0)) { > ifq_deq_rollback(&ifp->if_snd, m_head); > ... > return; > } > ifq_deq_commit(&ifp->if_snd, m_head); > > Here, ifq_deq_begin() grabs a mutex and it is held while > calling foo_encap(). > > For USB network interfaces foo_encap() mostly does this: > > err = usbd_transfer(sc->sc_xfer); > if (err != USBD_IN_PROGRESS) { > foo_stop(sc); > return EIO; > } > > And foo_stop() calls usbd_abort_pipe() -> xhci_command_submit(), > which might sleep. > > How to fix? We could do the foo_encap() after the ifq_deq_commit(), > possibly dropping the current mbuf if encap fails (who cares > for the packets after foo_stop() anyway). That's the approach taken by drivers using ifq_dequeue(9) instead of ifq_deq_begin/commit(). > Or change all the drivers to follow the path that if_aue.c takes: > > err = usbd_transfer(c->aue_xfer); > if (err != USBD_IN_PROGRESS) { > ... > /* Stop the interface from process context. */ > usb_add_task(sc->aue_udev, &sc->aue_stop_task); > return (EIO); > } That's just trading the current problem for another one with higher complexity. > Any ideas, what's better? Or alternative proposals? Using ifq_dequeue(9) would have the advantage of unifying the code base. It introduces a behavior change. A simpler fix would be to call foo_stop() in the error path after ifq_deq_rollback().
Re: X hangs
On 29/05/20(Fri) 15:57, Visa Hankala wrote: > On Fri, May 29, 2020 at 04:27:46PM +0200, Alexandre Ratchov wrote: > > On Thu, May 28, 2020 at 01:41:43PM +0100, Stuart Henderson wrote: > > > uaudio0 at uhub7 port 2 configuration 1 interface 1 "GN Netcom GN 9350" > > > rev 2.00/1.00 addr 7 > > > uaudio0: class v1, full-speed, sync, channels: 1 play, 1 rec, 4 ctls > > > audio1 at uaudio0 > > > uhidev0 at uhub7 port 2 configuration 1 interface 3 "GN Netcom GN 9350" > > > rev 2.00/1.00 addr 7 > > > uhidev0: iclass 3/0 > > > uhid0 at uhidev0: input=2, output=2, feature=0 > > > uaudio0: can't reset interface > > > uaudio0: can't reset interface > > > audio1 detached > > > uaudio0 detached > > > uhid0 detached > > > uhidev0 detached > > > RA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xde: > > > can't set interface > > > kernel: protection fault trap, code=0 > > > Stopped at uaudio_stream_close+0x8a: movzbl 0x8(%r12),%esi > > > ddb{3}> [-- sthen@localhost attached -- Thu May 28 11:58:19 2020] > > > > > > ddb{3}> > > > ddb{3}> tr > > > uaudio_stream_close(81dfb000,1) at uaudio_stream_close+0x8a > > > uaudio_stream_open(81dfb000,1,801e8000,801eaa80,2a8,816f7630) > > > at uaudio_stream_open+0x761 > > > uaudio_trigger_output(81dfb000,801e8000,801eaa80,2a8,816f7630,81e95c00) > > > at uaudio_trigger_output+0x47 > > > audio_start_do(81e95c00) at audio_start_do+0xb5 > > > audioioctl(2a01,20004126,800035a74470,7,800034fe6750) at > > > audioioctl+0x71 > > > VOP_IOCTL(fd867a72e9e0,20004126,800035a74470,7,fd84fea6f9c0,800034fe6750) > > > at VOP_IOCTL+0x55 > > > vn_ioctl(fd867d490f10,20004126,800035a74470,800034fe6750) at > > > vn_ioctl+0x75 > > > sys_ioctl(800034fe6750,800035a74580,800035a745e0) at > > > sys_ioctl+0x2df > > > syscall(800035a74650) at syscall+0x389 > > > Xsyscall() at Xsyscall+0x128 > > > end of kernel > > > > According to dmesg, audio1 was detached, so we shouldn't enter > > audio_start_do(). > > > > At this point the DVF_ACTIVE flag is clear; audioioctl() calls > > device_lookup() which is supposed to return NULL in this case, so > > ioctl() is supposed to return ENXIO, not attempt to start playback. > > Lets assume that audio_start_do() started when the device was still > attached to the system. In that case device_lookup() returned a pointer > to a good softc. This is supported by the fact that audio_start_do() did > not crash earlier. > > Did usbd_set_interface() block for a moment, letting the detachment > happen? The trace suggests that usbd_set_interface() failed, and when > audio_start_do() resumed, sc pointed to freed memory. The audio(4) drivers has an unaccounted reference to uaudio(4)'s softc. So when the USB thread responsible for detaching device kicks in to clean up the software state of an uaudio(4), it first spins on the KERNEL_LOCK(). If any of the threads playing/recording audio sleeps while holding an unaccounted reference to uaudio(4)'s softc, the above issue can happen. A way to fix this is to use usbd_ref_incr(9) and its counterpart usbd_ref_wait(9) in uaudio_detach(). I'm not sure if it's possible for audio(4) to increment the reference only once. Is there a place where such increment/decrement can be put? Otherwise every operation should do the dance.
Re: ipmi problem introduced with sys/conf.h 1.150 enodev->selfalse
On 28/06/20(Sun) 22:17, Stuart Henderson wrote: > Thanks to Jens A. Griepentrog for reporting and bisecting, we discovered > that sys/conf.h r1.150 broke /dev/ipmi. I found a machine to test on and > reverting the commit fixes things, but given the commit message I guess > the diff below (which also fixes it) might be better? Thanks for the finding. Your diff is indeed better and is ok mpi@. Could you please commit the version below that adds a matching kqfilter filter for `seltrue' as well? That will allow us to keep the behavior when switching poll(2) to use kqueue filters. Index: sys/conf.h === RCS file: /cvs/src/sys/sys/conf.h,v retrieving revision 1.152 diff -u -p -r1.152 conf.h --- sys/conf.h 26 May 2020 07:53:00 - 1.152 +++ sys/conf.h 29 Jun 2020 07:22:40 - @@ -473,8 +473,8 @@ extern struct cdevsw cdevsw[]; #define cdev_ipmi_init(c,n) { \ dev_init(c,n,open), dev_init(c,n,close), (dev_type_read((*))) enodev, \ (dev_type_write((*))) enodev, dev_init(c,n,ioctl), \ - (dev_type_stop((*))) enodev, 0, selfalse, \ - (dev_type_mmap((*))) enodev, 0 } + (dev_type_stop((*))) enodev, 0, seltrue, (dev_type_mmap((*))) enodev, \ + 0, 0, seltrue_kqfilter } /* open, close, ioctl, mmap */ #define cdev_kcov_init(c,n) { \
Re: Supermicro X10SDV-TP8F with USB3 won't boot
On 06/05/16(Fri) 01:13, Hrvoje Popovski wrote: > Hi, > > I've got > http://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-TP8F.cfm > for my openbsd lab. Default BIOS settings for usb is USB3 and with that > settings i can't install openbsd on it, or boot installed openbsd. > I have installed openbsd with disabled USB3 ie. USB2, complie kernel > with USB_DEBUG, EHCI_DEBUG, XHCI_DEBUG, UHCI_DEBUG, enable USB3 in BIOS > and boot... this is screenshot.. > http://kosjenka.srce.hr/~hrvoje/openbsd/usb.jpg Could you please tell me if the diff below solves your problem? Index: xhci_pci.c === RCS file: /cvs/src/sys/dev/pci/xhci_pci.c,v retrieving revision 1.7 diff -u -p -r1.7 xhci_pci.c --- xhci_pci.c 2 Nov 2015 14:53:10 - 1.7 +++ xhci_pci.c 31 May 2016 16:36:14 - @@ -258,8 +258,9 @@ xhci_pci_takecontroller(struct xhci_pci_ eec = -1; /* Synchronise with the BIOS if it owns the controller. */ - for (xecp = XHCI_HCC_XECP(cparams) << 2; xecp != 0; - xecp = XHCI_XECP_NEXT(eec) << 2) { + for (xecp = XHCI_HCC_XECP(cparams) << 2; + xecp != 0 && XHCI_XECP_NEXT(eec); + xecp += XHCI_XECP_NEXT(eec) << 2) { eec = XREAD4(&psc->sc, xecp); if (XHCI_XECP_ID(eec) != XHCI_ID_USB_LEGACY) continue;
Re: Supermicro X10SDV-TP8F with USB3 won't boot
On 31/05/16(Tue) 21:11, Evgeniy Sudyr wrote: > Hrvoje, > > looks my last comment was wrong. I apologise for detracting from this > important discussion. > > We all need support / fix for xhci(4) driver instead of disabling USB > 3.0 support. > > I have same issue on my desktop with Asus z170-k mainboard which is > based on Intel z170 chipset http://ark.intel.com/products/90591 > > Also my friend have Intel C236 chipset > http://ark.intel.com/products/90594 and he also have same issue on > board with both acpi and xhci > http://www.supermicro.com/products/motherboard/Xeon/C236_C232/X11SSH-TF.cfm > > I will be glad to help test patches on hardware above. I just committed the fix, please wait for the next snapshot or build from sources. Martin
Re: suspend resumes immediately on Toshiba Portege R30-A-1CD
On 06/06/16(Mon) 23:20, Giovanni Bechis wrote: > On Sun, Jun 05, 2016 at 09:39:23PM +0200, giova...@paclan.it wrote: > > >Synopsis: if I suspend my Toshiba laptop it resumes immediately > > >Category: kernel/acpi > > >Environment: > > System : OpenBSD 6.0 > > Details : OpenBSD 6.0-beta (GENERIC.MP) #2150: Mon May 30 20:21:47 > > MDT 2016 > > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > Architecture: OpenBSD.amd64 > > Machine : amd64 > > >Description: > > If I suspend my Toshiba Portege R30-A-1CD laptop with lid close or with > > zzz(8) it resumes immediately. > > Hybernation works as expected. > > >How-To-Repeat: > > Suspend a Toshiba R30-A-1CD laptop. > > >Fix: > > Unknown. > > > by disabling xhci(4) via config(8) I can suspend and resume, a strange > thing is that to resume I have to press the power button, any other key does > not do the job. How does your dmesg look like after trying to suspend?
Re: Touchscreen device is not calibrated after wake from suspend
On 04/06/16(Sat) 14:49, Edd Barrett wrote: > Hi, > > My x240t has a touch screen and a stylus. It works well upon first boot > (asides from the pointer co-ordinates are not yet translated when the > screen is rotated). However, after suspend and wake, placing the pen on > or near the screen will cause the pointer to jump to the bottom right > hand corner of the screen. > > I can fix this with: > > $ xinput --disable /dev/wsmouse3 > $ xinput --enable /dev/wsmouse3 > > Looking in my mail archives, I see that I spoke to mpi and matthieu > about this a while back (adding to CC). We did not find a suitable fix > and the problem persists. > > Matthieu's working theory was (and still is Matthieu?) as follows: > > ---8<--- > * Machine is resuming > * X comes back and at the same time USB devices reattaches > * the above is racy, so sometimes X comes back before its previous input > devices are back > * When that happens, X cannot reopen the input device, so it disables > it (but not cleanly - thats another issue I want to look at) > * When the USB device is reattached later, it gets back to the mux > * xf86-input-ws only gets events though the mux and thus can't apply > proper calibration > --->8--- The problem starts when you suspend. Your device is detached. So next time X will try to read from the corresponding /dev/wsmouse1 node it will fail. Now in practice X reads after resuming. Maybe Ulf has an idea of how to move the calibration logic to the kernel such that as soon as a new device attaches, it gets calibrated to match the corresponding screen.
Re: panic in upd
On 01/06/16(Wed) 16:39, Martijn van Duren wrote: > [...] > upd0 detached > uhidev0 detached > kernel: protection fault trap, code=0 > Stopped atupd_sensor_invalidate+0xe: movq0xc8(%rsi),%rbx > ddb{0}> trace > upd_sensor_invalidate() at upd_sensor_invalidate+0xe > upd_update_report_cb() at upd_update_report_cb+0x5b > uhidev_get_report_async_cb() at uhidev_get_report_async_cb+0x39 > usb_transfer_complete() at usb_transfer_complete+0x26c > xhci_event_command() at xhci_event_command+0x1c8 > xhci_event_dequeue() at xhci_event_dequeue+0x8a > xhci_softintr() at xhci_softintr+0x21 > softintr_dispatch() at softintr_dispatch+0x8b > end of kernel > end trace frame: 0x72defae4a00, count: -8 This looks like a race between the asynchronous callback and the device being detached. The problem is that the driver already freed its memory when the transfer completed. By checking if the device is dying before calling the callback we should prevent such crash. Could you at least confirm that the diff below does not introduce any regression? Index: uhidev.c === RCS file: /cvs/src/sys/dev/usb/uhidev.c,v retrieving revision 1.73 diff -u -p -r1.73 uhidev.c --- uhidev.c9 Jan 2016 04:14:42 - 1.73 +++ uhidev.c7 Jun 2016 15:21:15 - @@ -96,8 +96,7 @@ void uhidev_attach(struct device *, stru int uhidev_detach(struct device *, int); int uhidev_activate(struct device *, int); -void uhidev_get_report_async_cb(struct usbd_xfer *xfer, void *priv, -usbd_status status); +void uhidev_get_report_async_cb(struct usbd_xfer *, void *, usbd_status); struct cfdriver uhidev_cd = { NULL, "uhidev", DV_DULL @@ -754,17 +753,19 @@ uhidev_get_report_async_cb(struct usbd_x char *buf; int len = -1; - if (err == USBD_NORMAL_COMPLETION || err == USBD_SHORT_XFER) { - len = xfer->actlen; - buf = KERNADDR(&xfer->dmabuf, 0); - if (info->id > 0) { - len--; - memcpy(info->data, buf + 1, len); - } else { - memcpy(info->data, buf, len); + if (!usbd_is_dying(xfer->pipe->device)) { + if (err == USBD_NORMAL_COMPLETION || err == USBD_SHORT_XFER) { + len = xfer->actlen; + buf = KERNADDR(&xfer->dmabuf, 0); + if (info->id > 0) { + len--; + memcpy(info->data, buf + 1, len); + } else { + memcpy(info->data, buf, len); + } } + info->callback(info->priv, info->id, info->data, len); } - info->callback(info->priv, info->id, info->data, len); free(info, M_TEMP, sizeof(*info)); usbd_free_xfer(xfer); }
Re: uvm_fault in ip6_output_ipsec_lookup() / ip6_output()
On 14/06/16(Tue) 15:18, Florian Obser wrote: > Hi, > I'm seeing this panic on my v6 gateway running in a vm (don't ask): > It has a v6 tunnel via HE on gif0. > > I hope I copied all relevant information, if not, my appologies, I'm > in a hurry currently, please just ask for more. > > I will probably investigate more when I'm home :) > > panic: trap type 6, code=0, pc=812fe70f > Starting stack trace... > panic() at panic+0x10b > trap() at trap+0x7b8 > --- trap (number 6) --- > ip6_output_ipsec_lookup() at ip6_output_ipsec_lookup+0x6f > ip6_output() at ip6_output+0x21c > esp_output_cb() at esp_output_cb+0x135 > taskq_thread() at taskq_thread+0x6c > end trace frame: 0x0, count: 251 > End of stack trace. > syncing disks... done This seems to be an invalid `'tdbi'' dereference in ip6_output_ipsec_lookup(): 2890: tdbi = (struct tdb_ident *)(mtag + 1); 2891: HERE -> if (tdbi->spi == tdb->tdb_spi && 2892: tdbi->proto == tdb->tdb_sproto && ... Markus, Mike any idea how this could happen? > on: > > OpenBSD 6.0-beta (GENERIC.MP) #2165: Thu Jun 2 08:37:59 MDT 2016 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > it has ddb.panic=0 but I can change that when I'm home. > > [florian@openbsd:~]$ doas cat /etc/ipsec.conf > ike esp from 2001:470:7afd::1 \ > to 2a02:d40:3:1:4c7:b9ff:fede:705f \ > psk XXX > > ike esp from 2001:470:7afd:1::1 \ > to 2a02:d40:3:1:4c7:b9ff:fede:705f \ > psk XXX > > ike esp from 2001:470:1f14:47e::2 \ > to 2a02:d40:3:1:4c7:b9ff:fede:705f \ > psk XXX > > I can trigger the panic when the flows are up and I do this on the > remote system: > > [florian@tlakh:~]$ ping6 -I 2a02:d40:3:1:4c7:b9ff:fede:705f 2001:470:7afd::1 > > > [florian@openbsd:~]$ doas ipsecctl -sa > FLOWS: > flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 peer > 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use > flow esp out from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f > peer 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require > flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:7afd::1 peer > 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use > flow esp out from 2001:470:7afd::1 to 2a02:d40:3:1:4c7:b9ff:fede:705f peer > 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require > flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:7afd:1::1 peer > 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use > flow esp out from 2001:470:7afd:1::1 to 2a02:d40:3:1:4c7:b9ff:fede:705f peer > 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid > 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require > > SAD: > esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi > 0x07b097ae auth hmac-sha2-256 enc aes > esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi > 0x471d9a35 auth hmac-sha2-256 enc aes > esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi > 0x4d6962f0 auth hmac-sha2-256 enc aes > esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi > 0x546e354d auth hmac-sha2-256 enc aes > esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi > 0x9d83602b auth hmac-sha2-256 enc aes > esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi > 0xe2d99e91 auth hmac-sha2-256 enc aes > > > [florian@openbsd:~]$ netstat -rn > Routing tables > > Internet: > DestinationGatewayFlags Refs Use Mtu Prio Iface > default192.168.2.254 UGS 17 4831 - 8 vio0 > 224/4 127.0.0.1 URS00 32768 8 lo0 > 10.11.12/2410.11.12.1 UC 10 - 4 vio1 > 10.11.12.1 52:54:00:15:bb:62 UHLl 01 - 1 vio1 > 10.11.12.3252:54:00:dc:6f:cd UHLc 0 144 - 4 vio1 > 10.11.12.255 10.11.12.1 UHb00 - 1 vio1 > 127/8 127.0.0.1 UGRS 00 32768 8 lo0 > 127.0.0.1 127.0.0.1 UHl 12 1129 32768 1 lo0 > 192.168.2/24 192.168.2.253 UC 22 - 4 vio0 > 192.168.2.180:ee:73:67:d1:9c UHLc 18 - 4 vio0 > 192.168.2.253 52:54:00:1a:59:59 UHLl 1 9560 - 1 vio0 > 192.168.2.254 4c:09:d4:ca:0c:b2 UHLc 15 - 4 vio0 > 192.168.2.255 192.168.2.253 UHb0 14 - 1 vio0 > > Internet6: > Destination
Re: uvm_fault in ip6_output_ipsec_lookup() / ip6_output()
On 14/06/16(Tue) 20:10, Florian Obser wrote: > On Tue, Jun 14, 2016 at 06:26:00PM +0200, Martin Pieuchot wrote: > > On 14/06/16(Tue) 15:18, Florian Obser wrote: > > > Hi, > > > I'm seeing this panic on my v6 gateway running in a vm (don't ask): > > > It has a v6 tunnel via HE on gif0. > > > > > > I hope I copied all relevant information, if not, my appologies, I'm > > > in a hurry currently, please just ask for more. > > > > > > I will probably investigate more when I'm home :) > > > > > > panic: trap type 6, code=0, pc=812fe70f > > > Starting stack trace... > > > panic() at panic+0x10b > > > trap() at trap+0x7b8 > > > --- trap (number 6) --- > > > ip6_output_ipsec_lookup() at ip6_output_ipsec_lookup+0x6f > > > ip6_output() at ip6_output+0x21c > > > esp_output_cb() at esp_output_cb+0x135 > > > taskq_thread() at taskq_thread+0x6c > > > end trace frame: 0x0, count: 251 > > > End of stack trace. > > > syncing disks... done > > > > This seems to be an invalid `'tdbi'' dereference in > > ip6_output_ipsec_lookup(): > > > > 2890: tdbi = (struct tdb_ident *)(mtag + 1); > > 2891: HERE -> if (tdbi->spi == tdb->tdb_spi && > > 2892: tdbi->proto == tdb->tdb_sproto && > > ... > > > > Markus, Mike any idea how this could happen? > > I tracked it down to ref 1.89 of ip6_forward.c / ref 1.205 ip6_output.c: > "factor out ipsec into ip6_output_ipsec_{lookup,send}(); ok mpi@, naddy@" > > The problem is that we are not exiting the "loop detection" for loop > when tdb is set to NULL. We enter again and dereference tdb -> boom. > > The following diff makes ip6_output_ipsec_lookup() similar to > ip_output_ipsec_lookup(). > It's easier to see what the diff is doing by applying and doing diff -b. > OK? ok mpi@ > p.s. I also note that the v4 and v6 version are really similiar, we > can probably merge them. Wonder if it's worth it or if it's best to > keep v4 and v6 seperate... For the moment we're trying to reduce the size of "#ifdef IPSEC" chunks inside the IP paths. But reducing differences between v4 and v6 by reusing code is a good thing. > diff --git ip6_output.c ip6_output.c > index 64eea86..3adaa7d 100644 > --- ip6_output.c > +++ ip6_output.c > @@ -2882,21 +2882,21 @@ ip6_output_ipsec_lookup(struct mbuf *m, int *error, > struct inpcb *inp) > tdb = ipsp_spd_lookup(m, AF_INET6, sizeof(struct ip6_hdr), > error, IPSP_DIRECTION_OUT, NULL, inp, 0); > > - if (tdb != NULL) { > - /* Loop detection */ > - for (mtag = m_tag_first(m); mtag != NULL; > - mtag = m_tag_next(m, mtag)) { > - if (mtag->m_tag_id != PACKET_TAG_IPSEC_OUT_DONE) > - continue; > - tdbi = (struct tdb_ident *)(mtag + 1); > - if (tdbi->spi == tdb->tdb_spi && > - tdbi->proto == tdb->tdb_sproto && > - tdbi->rdomain == tdb->tdb_rdomain && > - !bcmp(&tdbi->dst, &tdb->tdb_dst, > - sizeof(union sockaddr_union))) > - tdb = NULL; > + if (tdb == NULL) > + return NULL; > + /* Loop detection */ > + for (mtag = m_tag_first(m); mtag != NULL; mtag = m_tag_next(m, mtag)) { > + if (mtag->m_tag_id != PACKET_TAG_IPSEC_OUT_DONE) > + continue; > + tdbi = (struct tdb_ident *)(mtag + 1); > + if (tdbi->spi == tdb->tdb_spi && > + tdbi->proto == tdb->tdb_sproto && > + tdbi->rdomain == tdb->tdb_rdomain && > + !memcmp(&tdbi->dst, &tdb->tdb_dst, > + sizeof(union sockaddr_union))) { > + /* no IPsec needed */ > + return NULL; > } > - /* We need to do IPsec */ > } > return tdb; > } > > > -- > I'm not entirely sure you are real. >
Re: snapshot bsd.rd delay after umass at uhub (Lenovo x220 F5521gw WWAN)
On 19/06/16(Sun) 11:26, Marcus MERIGHI wrote: > When booting bsd.rd, after the line > > umass0 at uhub4 port 4 configuration 3 interface 0 "Lenovo F5521gw" rev > 2.00/0.00 addr 3 > > there is a long (minutes) delay. > > To me it seems bsd.rd these days finds a umass device the so-called WWAN > interface (GPRS/UMTS/LTE+GPS) provides. Let me guess, this device doesn't attach as umass(4) with bsd. I bet the umass(4) driver is generating timeouts, If you don't use the device you can disable it in your BIOS. > > Possibly related: > http://marc.info/?l=openbsd-tech&m=146619500807823 > "It revamps the way we look up interface descriptors quite a bit. I > removed the unused code for matching devices based on vendor and product > ids." (kettenis@) > > dmesg of bsd.rd and unmodified self-compiled kernel below. lsusb -v > output at the very end. > > OpenBSD 6.0-beta (RAMDISK_CD) #1982: Sat Jun 18 11:42:13 MDT 2016 > r...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/RAMDISK_CD > RTC BIOS diagnostic error 80 > real mem = 8451125248 (8059MB) > avail mem = 8193282048 (7813MB) > mainbus0 at root > bios0 at mainbus0: SMBIOS rev. 2.6 @ 0xdae9c000 (64 entries) > bios0: vendor LENOVO version "8DET72WW (1.42 )" date 02/18/2016 > bios0: LENOVO 4291QQ1 > acpi0 at bios0: rev 2 > acpi0: tables DSDT FACP SLIC SSDT SSDT SSDT HPET APIC MCFG ECDT ASF! TCPA > SSDT SSDT DMAR UEFI UEFI UEFI > acpimadt0 at acpi0 addr 0xfee0: PC-AT compat > cpu0 at mainbus0: apid 0 (boot processor) > cpu0: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, 797.54 MHz > cpu0: > FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,LONG,LAHF,PERF,ITSC,SENSOR,ARAT > cpu0: 256KB 64b/line 8-way L2 cache > cpu0: apic clock running at 99MHz > cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE > cpu at mainbus0: not configured > cpu at mainbus0: not configured > cpu at mainbus0: not configured > ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins > acpiec0 at acpi0 > acpiprt0 at acpi0: bus 0 (PCI0) > acpiprt1 at acpi0: bus -1 (PEG_) > acpiprt2 at acpi0: bus 2 (EXP1) > acpiprt3 at acpi0: bus 3 (EXP2) > acpiprt4 at acpi0: bus 5 (EXP4) > acpiprt5 at acpi0: bus 13 (EXP5) > acpiprt6 at acpi0: bus 14 (EXP7) > acpicpu at acpi0 not configured > acpipwrres at acpi0 not configured > acpitz at acpi0 not configured > "PNP0C0D" at acpi0 not configured > "PNP0C0E" at acpi0 not configured > "PNP0303" at acpi0 not configured > "LEN0020" at acpi0 not configured > "PNP0C0A" at acpi0 not configured > "ACPI0003" at acpi0 not configured > "LEN0068" at acpi0 not configured > "PNP0C14" at acpi0 not configured > "PNP0C14" at acpi0 not configured > pci0 at mainbus0 bus 0 > pchb0 at pci0 dev 0 function 0 "Intel Core 2G Host" rev 0x09 > vga1 at pci0 dev 2 function 0 "Intel HD Graphics 3000" rev 0x09 > wsdisplay1 at vga1 mux 1: console (80x25, vt100 emulation) > "Intel 6 Series MEI" rev 0x04 at pci0 dev 22 function 0 not configured > em0 at pci0 dev 25 function 0 "Intel 82579LM" rev 0x04: msi, address > f0:de:f1:8f:84:ac > ehci0 at pci0 dev 26 function 0 "Intel 6 Series USB" rev 0x04: apic 2 int 16 > usb0 at ehci0: USB revision 2.0 > uhub0 at usb0 "Intel EHCI root hub" rev 2.00/1.00 addr 1 > "Intel 6 Series HD Audio" rev 0x04 at pci0 dev 27 function 0 not configured > ppb0 at pci0 dev 28 function 0 "Intel 6 Series PCIE" rev 0xb4: msi > pci1 at ppb0 bus 2 > ppb1 at pci0 dev 28 function 1 "Intel 6 Series PCIE" rev 0xb4: msi > pci2 at ppb1 bus 3 > iwn0 at pci2 dev 0 function 0 "Intel Centrino Ultimate-N 6300" rev 0x35: msi, > MIMO 3T3R, MoW, address 00:24:d7:f0:ea:90 > ppb2 at pci0 dev 28 function 3 "Intel 6 Series PCIE" rev 0xb4: msi > pci3 at ppb2 bus 5 > ppb3 at pci0 dev 28 function 4 "Intel 6 Series PCIE" rev 0xb4: msi > pci4 at ppb3 bus 13 > sdhc0 at pci4 dev 0 function 0 "Ricoh 5U822 SD/MMC" rev 0x07: apic 2 int 16 > sdhc0: SDHC 3.0, 50 MHz base clock > sdmmc0 at sdhc0: 4-bit, sd high-speed, mmc high-speed, dma > ppb4 at pci0 dev 28 function 6 "Intel 6 Series PCIE" rev 0xb4: msi > pci5 at ppb4 bus 14 > xhci0 at pci5 dev 0 function 0 "NEC xHCI" rev 0x04: msi > usb1 at xhci0: USB revision 3.0 > uhub1 at usb1 "NEC xHCI root hub" rev 3.00/1.00 addr 1 > ehci1 at pci0 dev 29 function 0 "Intel 6 Series USB" rev 0x04: apic 2 int 23 > usb2 at ehci1: USB revision 2.0 > uhub2 at usb2 "Intel EHCI root hub" rev 2.00/1.00 addr 1 > "Intel QM67 LPC" rev 0x04 at pci0 dev 31 function 0 not configured > ahci0 at pci0 dev 31 function 2 "Intel 6 Series AHCI" rev 0x04: msi, AHCI 1.3 > ahci0: port 0: 6.0Gb/s > scsibus0 at ahci0: 32 targets > sd0 at scsibus0 targ 0 lun 0: SCSI3 0/direct > fixed naa.5001b44e1d7ef244 > sd0: 228936MB, 512 bytes/sector, 468862128 sectors, thin > "Intel 6 Series SMBus" rev 0x04 at pci0 dev 31 function 3 not configured > isa0 at mainbus0 > pckbc0 at isa0 port
Re: snapshot bsd.rd delay after umass at uhub (Lenovo x220 F5521gw WWAN)
On 19/06/16(Sun) 14:23, Marcus MERIGHI wrote: > m...@openbsd.org (Martin Pieuchot), 2016.06.19 (Sun) 13:28 (CEST): > > On 19/06/16(Sun) 11:26, Marcus MERIGHI wrote: > > > When booting bsd.rd, after the line > > > > > > umass0 at uhub4 port 4 configuration 3 interface 0 "Lenovo F5521gw" rev > > > 2.00/0.00 addr 3 > > > > > > there is a long (minutes) delay. > > > > > > To me it seems bsd.rd these days finds a umass device the so-called WWAN > > > interface (GPRS/UMTS/LTE+GPS) provides. > > > > Let me guess, this device doesn't attach as umass(4) with bsd. > > True. > But as opposed to Jun 2 snapshot I now get a ugen0 device (apart from > three ucom(4)s). > > > > I bet the umass(4) driver is generating timeouts, If you don't use the > > device you can disable it in your BIOS. > > True. > Disabling is bad for finding out whether umb(4) would support the > device... Well if you want to debug it, build your own bsd.rd with UMASS_DEBUG and/or SCSI_DEBUG to see which command is timing out.