from:"Martin Pieuchot"

Re: amd64: stuck in netlock

2018-02-23 Thread Martin Pieuchot

On 29/01/18(Mon) 21:25, Artturi Alm wrote:
> On Mon, Jan 29, 2018 at 08:03:38PM +0100, Martin Pieuchot wrote:
> > On 29/01/18(Mon) 20:38, Artturi Alm wrote:
> > > On Mon, Jan 29, 2018 at 10:42:20AM +0100, Martin Pieuchot wrote:
> > > > Hello Artturi,
> > > > 
> > > > On 28/01/18(Sun) 09:08, Artturi Alm wrote:
> > > > > >Synopsis:stuck in netlock
> > > > > >Category:amd64
> > > > > >Environment:
> > > > >   System  : OpenBSD 6.2
> > > > >   Details : OpenBSD 6.2-current (GENERIC.MP) #333: Sun Jan  7 
> > > > > 09:13:00 MST 2018
> > > > >
> > > > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > > > 
> > > > >   Architecture: OpenBSD.amd64
> > > > >   Machine : amd64
> > > > > >Description:
> > > > >   processes getting stuck w/STATE=netlock, kill has no effect.
> > > > > >How-To-Repeat:
> > > > >   using the desktop normally, until trying to restart chrome ends
> > > > >   up failing.
> > > > 
> > > > What do you mean with "using the desktop normally"?  Which applications
> > > > are you using?  Which browser plugins?  Can you find out the minimum
> > > > setup to reproduce this deadlock?
> > > > 
> > > > >   I've had this happen to me atleast twice in the last few of 
> > > > > weeks.
> > > > 
> > > > Do you know how to reproduce it easily?
> > > > 
> > > 
> > > this time i had less than 10tabs open, so i guess it can be narrowed
> > > down even further.
> > > 
> > > > >   At first time i noticed how trying to launch chrome did lock up
> > > > >   all the other processes in netlock, and "pkill chrome" did allow
> > > > >   the system to recover, i was unable to figure out what was wrong
> > > > >   and rebooting did make everything work again, while ie.
> > > > >   removing ~/.cache & ~/.config did not.
> > > > 
> > > > So the deadlock is related to your chrome usage?
> > > > 
> > > 
> > > now it does feel like so. i'll upgrade tonight.
> > > 
> > > > >   long before running the "ps cl" below, i had already killed all
> > > > >   the xterm-windows those processes were in. cwm(1) was unable to
> > > > >   kill some of those, but xkill did not.
> > > > 
> > > > Well killing process waiting for the 'netlock' won't help.  What has to
> > > > be find is which process is holding it.  For that we need the full ps
> > > > output, including kernel and userland threads.
> > > > > 
> > > > >   after exiting X w/ctrl+alt+backspace(iirc?) i didn't get back to
> > > > >   $-prompt, and ^T did show xauth stuck in netlock..
> > > > >   i guess it's obvious where it was heading; so i got pics of
> > > > >   "# reboot -nq" failing because stuck in the fckng netlock -_-
> > > > > 
> > > > >   i do have ddb.{panic,console,log}=1, but
> > > > >   "# sysctl ddb.trigger=1" ==
> > > > >   "sysctl: ddb.trigger: Operation not supported by device"
> > > > 
> > > > Not having DDB access will limit the debugging experience.  Are you sure
> > > > you tried to enter it on your console?
> > > > 
> > > 
> > > so this requires ttyC0, right?
> > > this time it was ifconfig in [netlock], that prevented using ttyC0.
> > > i got there from X by running "virsh shutdown  > > i guess it emulates what pressing actual power button would(acpi?).
> > > 
> > > > >   ?? so i had no option but "virsh reset "...
> > > > 
> > > > Did you try top(1)?  What were the kernel processes doing?
> > > 
> > > see below, if "top -bCHS -d 1 999" should do.
> > > anything else i could do? anyway, thanks in advance:)
> > 
> > This is where the problems comes from: 
> > 
> > > 33315   443734  -60  141M  102M idle  viowait   0:00  0.00% 
> > > chrome: 
> > 
> > I don't understand how chrome can end up sleeping in vio_ioctl

Re: uvideo0: could not open VS pipe: INVAL

2018-02-26 Thread Martin Pieuchot

On 26/02/18(Mon) 15:20, C. wrote:
> Category:
>   Webcam / Video
> 
> Environment:
> System  : OpenBSD 6.2
> Details : OpenBSD 6.2 (GENERIC.MP) #5: Fri Feb  2 23:02:19 CET 
> 2018
>  
> r...@syspatch-62-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> Architecture: OpenBSD.amd64
> Machine : amd64
> 
> Description:
>   The integraged webcam (Lenovo Thinkpad T470, AzureWave Integrated 
> Webcam) does not work. 
>   It's neither working in Firefox, nor in Chromium, nor in VLC, nor via 
> fswebcam or luvcview.
>   The firmware for the uvideo driver has been installed. 

There's currently no support for isochronous transfers on xhci(4).  Some
code is there but it has to be debugged and enabled.

Re: gdb hangs on exiting a running program

2018-03-19 Thread Martin Pieuchot

Thanks for the report.

On 19/03/18(Mon) 09:49, Theo Buehler wrote:
> This is a regression that came with the TOCTOU race fix in kern_sig.c 1.216:
> https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216
> [...] 
> Now gdb just hangs there and does nothing instead of exiting as
> expected.  It doesn't react to ^C but one can easily kill it with
> ^Z and then kill %%.

What happens is that the programs stays stopped.  Or to be more precise
re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by
gdb(1).
The problem comes from the fact that CURSIG() is now called twice in
userret().  That means that issignal() is also called twice.  The fix
is to treat SIGKILL as special if the process is currently traced.

That's also what NetBSD is doing, so I synced our comment with their,
without a typo.

ok?

Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.216
diff -u -p -r1.216 kern_sig.c
--- kern/kern_sig.c 26 Feb 2018 13:33:25 -  1.216
+++ kern/kern_sig.c 19 Mar 2018 11:25:34 -
@@ -1167,11 +1167,13 @@ issignal(struct proc *p)
(pr->ps_flags & PS_TRACED) == 0)
continue;
 
-   if ((pr->ps_flags & (PS_TRACED | PS_PPWAIT)) == PS_TRACED) {
-   /*
-* If traced, always stop, and stay
-* stopped until released by the debugger.
-*/
+   /*
+* If traced, always stop, and stay stopped until released
+* by the debugger.  If our parent process is waiting for
+* us, don't hang as we could deadlock.
+*/
+   if (((pr->ps_flags & (PS_TRACED | PS_PPWAIT)) == PS_TRACED) &&
+   signum != SIGKILL) {
p->p_xstat = signum;
 
if (dolock)

Re: vmctl stop + tcpdump results in netlock panic

2018-03-19 Thread Martin Pieuchot

On 19/03/18(Mon) 15:58, Stefan Sperling wrote:
> The following will trigger "panic: rw_enter: netlock locking against myself":

The solution is to call bpfdetach() outside of the NET_LOCK(), it should
not need it.  Diff below does that, does it work for you?

Index: net/if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.548
diff -u -p -r1.548 if.c
--- net/if.c2 Mar 2018 15:52:11 -   1.548
+++ net/if.c19 Mar 2018 15:22:17 -
@@ -1028,6 +1028,10 @@ if_detach(struct ifnet *ifp)
/* Other CPUs must not have a reference before we start destroying. */
if_idxmap_remove(ifp);
 
+#if NBPFILTER > 0
+   bpfdetach(ifp);
+#endif
+
NET_LOCK();
s = splnet();
ifp->if_qstart = if_detached_qstart;
@@ -1041,9 +1045,6 @@ if_detach(struct ifnet *ifp)
/* Remove the link state task */
task_del(net_tq(ifp->if_index), &ifp->if_linkstatetask);
 
-#if NBPFILTER > 0
-   bpfdetach(ifp);
-#endif
rti_delete(ifp);
 #if NETHER > 0 && defined(NFSCLIENT)
if (ifp->if_index == revarp_ifidx)

Re: gdb hangs on exiting a running program

2018-03-20 Thread Martin Pieuchot

On 19/03/18(Mon) 15:38, Visa Hankala wrote:
> On Mon, Mar 19, 2018 at 12:27:10PM +0100, Martin Pieuchot wrote:
> > Thanks for the report.
> > 
> > On 19/03/18(Mon) 09:49, Theo Buehler wrote:
> > > This is a regression that came with the TOCTOU race fix in kern_sig.c 
> > > 1.216:
> > > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216
> > > [...] 
> > > Now gdb just hangs there and does nothing instead of exiting as
> > > expected.  It doesn't react to ^C but one can easily kill it with
> > > ^Z and then kill %%.
> > 
> > What happens is that the programs stays stopped.  Or to be more precise
> > re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by
> > gdb(1).
> > The problem comes from the fact that CURSIG() is now called twice in
> > userret().  That means that issignal() is also called twice.  The fix
> > is to treat SIGKILL as special if the process is currently traced.
> 
> As an alternative, the double call of issignal() could be avoided.

I like this.  But I still think that we should handle SIGKILL correctly
in CURSIG().  However your fix seems safer for release.

> CURSIG(p) evaluates to zero if p->p_siglist is zero, or eventually
> issignal(p) returns zero if there are no unmasked signals (that is,
> if (p->p_siglist & ~p->p_sigmask) == 0).

But if the process is being traced issignal() is always called.  Does
that mean that the `PS_TRACED' check is useless because issignal() also
starts with  if (p->p_siglist & ~p->p_sigmask) == 0?

I'd prefer if you could used a function (inline) with an explicit name
like hassignal() or unmaskedsignal()?

> Index: kern/kern_sig.c
> ===
> RCS file: src/sys/kern/kern_sig.c,v
> retrieving revision 1.216
> diff -u -p -r1.216 kern_sig.c
> --- kern/kern_sig.c   26 Feb 2018 13:33:25 -  1.216
> +++ kern/kern_sig.c   19 Mar 2018 15:28:33 -
> @@ -1833,7 +1833,7 @@ userret(struct proc *p)
>   KERNEL_UNLOCK();
>   }
>  
> - if (CURSIG(p) != 0) {
> + if ((p->p_siglist & ~p->p_sigmask) != 0) {
>   KERNEL_LOCK();
>   while ((signum = CURSIG(p)) != 0)
>   postsig(p, signum);

Re: NFS socket use after free during reboot

2018-03-20 Thread Martin Pieuchot

On 08/03/18(Thu) 23:16, Alexander Bluhm wrote:
> Hi,
> 
> When rebooting the NFS client while the NFS file system is actively
> used, the kernel crashes.  The socket at 0xd73c2d9c is filled with
> dead beef, so it is a use after free.  It is an i386 kernel built
> today.

There are multiple known issues with umounting a busy NFS client.
These issues were previously masked by the "remount read-only"
logic at shutdown.

> root@ot2:.../~# find /mount >/dev/null & sleep 5; reboot -q
> [1] 9698
> syncing disks... uvm_fault(0xd72afc7c, 0x1ff11000, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at  sblock+0x12:movl0x4(%eax),%eax
> ddb{0}> trace
> sblock(d73c2d9c,d73c2df0,1) at sblock+0x12
> soreceive(d73c2d9c,0,f548d818,f548d884,0,f548d804,0) at soreceive+0x271
> nfs_receive(d7471f7c,f548d87c,f548d884) at nfs_receive+0xb1
> nfs_reply(d7471f7c) at nfs_reply+0x62
> nfs_request(d6d1f3c4,10,f548d970) at nfs_request+0x24d
> nfs_readdirrpc(d6d1f3c4,f548d9f8,d7499120,f548d9ec) at nfs_readdirrpc+0x1dc
> nfs_readdir(f548dab0) at nfs_readdir+0x227
> VOP_READDIR(d6d1f3c4,f548daf8,d7499120,f548daec) at VOP_READDIR+0x42
> sys_getdents(d71372dc,f548db68,f548db60) at sys_getdents+0x118
> syscall() at syscall+0x204
> --- syscall (number 0) ---

Your trace shows two things.  First of all the userland thread doing
getdents(2) is getting schedule after nfs_unmount() has freed the
socket.  Secondly it shows that such thread has no way to know that
the socket is no longer valid.

My previous attempt to fix this problem, my preventing all reconnect as
soon as nfs_unmount() has been called only moved the panic to a
different layer because NFS node don't have proper locking.

So here's a diff to add locking to NFS nodes.  I couldn't reproduce the
panic above with it.  So I'd be interested if you could try it.  Note
that I didn't do much tests in write mode, so I'd suggest exporting your
"/mount" as 'ro' in a first time.  Diskless setups are also probably
broken.

Index: nfs/nfs_node.c
===
RCS file: /cvs/src/sys/nfs/nfs_node.c,v
retrieving revision 1.65
diff -u -p -r1.65 nfs_node.c
--- nfs/nfs_node.c  27 Sep 2016 01:37:38 -  1.65
+++ nfs/nfs_node.c  20 Mar 2018 12:31:40 -
@@ -58,8 +58,6 @@
 struct pool nfs_node_pool;
 extern int prtactive;
 
-struct rwlock nfs_hashlock = RWLOCK_INITIALIZER("nfshshlk");
-
 /* XXX */
 extern struct vops nfs_vops;
 
@@ -98,12 +96,10 @@ nfs_nget(struct mount *mnt, nfsfh_t *fh,
nmp = VFSTONFS(mnt);
 
 loop:
-   rw_enter_write(&nfs_hashlock);
find.n_fhp = fh;
find.n_fhsize = fhsize;
np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find);
if (np != NULL) {
-   rw_exit_write(&nfs_hashlock);
vp = NFSTOV(np);
error = vget(vp, LK_EXCLUSIVE, p);
if (error)
@@ -120,25 +116,28 @@ loop:
 * to see if this nfsnode has been added while we did not hold
 * the lock.
 */
-   rw_exit_write(&nfs_hashlock);
error = getnewvnode(VT_NFS, mnt, &nfs_vops, &nvp);
/* note that we don't have this vnode set up completely yet */
-   rw_enter_write(&nfs_hashlock);
if (error) {
*npp = NULL;
-   rw_exit_write(&nfs_hashlock);
return (error);
}
nvp->v_flag |= VLARVAL;
-   np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find);
-   if (np != NULL) {
+   np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO);
+   /*
+* getnewvnode() and pool_get() can sleep, check for race.
+*/
+   if (RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find) != NULL) {
+   pool_put(&nfs_node_pool, np);
vgone(nvp);
-   rw_exit_write(&nfs_hashlock);
goto loop;
}
 
vp = nvp;
-   np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO);
+#ifdef VFSLCKDEBUG
+   vp->v_flag |= VLOCKSWORK;
+#endif
+   rrw_init_flags(&np->n_lock, "nfsnode", RWL_DUPOK | RWL_IS_VNODE);
vp->v_data = np;
/* we now have an nfsnode on this vnode */
vp->v_flag &= ~VLARVAL;
@@ -159,10 +158,11 @@ loop:
np->n_fhp = &np->n_fh;
bcopy(fh, np->n_fhp, fhsize);
np->n_fhsize = fhsize;
+   /* lock the nfsnode, then put it on the rbtree */
+   rrw_enter(&np->n_lock, RW_WRITE);
np2 = RBT_INSERT(nfs_nodetree, &nmp->nm_ntree, np);
KASSERT(np2 == NULL);
np->n_accstamp = -1;
-   rw_exit(&nfs_hashlock);
*npp = np;
 
return (0);
@@ -201,9 +201,10 @@ nfs_inactive(void *v)
 * Remove the silly file that was rename'd earlier
 */
nfs_vinvalbuf(ap->a_vp, 0, sp->s_cred, curproc);
+   vn_lock(sp->s_dvp, LK_EXCLUSIVE | LK_RETRY, curproc);
nfs_removeit(sp);
crfree(sp->s_cred);
-   vrele(sp->s_dvp);
+

Re: gdb hangs on exiting a running program

2018-03-23 Thread Martin Pieuchot

On 20/03/18(Tue) 17:04, Visa Hankala wrote:
> On Tue, Mar 20, 2018 at 10:45:56AM +0100, Martin Pieuchot wrote:
> > On 19/03/18(Mon) 15:38, Visa Hankala wrote:
> > > On Mon, Mar 19, 2018 at 12:27:10PM +0100, Martin Pieuchot wrote:
> > > > Thanks for the report.
> > > > 
> > > > On 19/03/18(Mon) 09:49, Theo Buehler wrote:
> > > > > This is a regression that came with the TOCTOU race fix in kern_sig.c 
> > > > > 1.216:
> > > > > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_sig.c#rev1.216
> > > > > [...] 
> > > > > Now gdb just hangs there and does nothing instead of exiting as
> > > > > expected.  It doesn't react to ^C but one can easily kill it with
> > > > > ^Z and then kill %%.
> > > > 
> > > > What happens is that the programs stays stopped.  Or to be more precise
> > > > re-enter the SSTOP'd state after ptrace(PT_KILL...) has been issued by
> > > > gdb(1).
> > > > The problem comes from the fact that CURSIG() is now called twice in
> > > > userret().  That means that issignal() is also called twice.  The fix
> > > > is to treat SIGKILL as special if the process is currently traced.
> > > 
> > > As an alternative, the double call of issignal() could be avoided.
> > 
> > I like this.  But I still think that we should handle SIGKILL correctly
> > in CURSIG().  However your fix seems safer for release.
> > 
> > > CURSIG(p) evaluates to zero if p->p_siglist is zero, or eventually
> > > issignal(p) returns zero if there are no unmasked signals (that is,
> > > if (p->p_siglist & ~p->p_sigmask) == 0).
> > 
> > But if the process is being traced issignal() is always called.  Does
> > that mean that the `PS_TRACED' check is useless because issignal() also
> > starts with  if (p->p_siglist & ~p->p_sigmask) == 0?
> 
> So it seems. The trace point is taken only if the signal mask allows
> signal delivery.
> 
> > I'd prefer if you could used a function (inline) with an explicit name
> > like hassignal() or unmaskedsignal()?
> 
> Updated diff:

I like it.  I you don't return a boolean but the mask of pending signals
in the macro we could use it in issignal().  But that can be for a later
change.

ok mpi@

> Index: kern/kern_sig.c
> ===
> RCS file: src/sys/kern/kern_sig.c,v
> retrieving revision 1.216
> diff -u -p -r1.216 kern_sig.c
> --- kern/kern_sig.c   26 Feb 2018 13:33:25 -  1.216
> +++ kern/kern_sig.c   20 Mar 2018 16:53:25 -
> @@ -1833,7 +1833,7 @@ userret(struct proc *p)
>   KERNEL_UNLOCK();
>   }
>  
> - if (CURSIG(p) != 0) {
> + if (SIGPENDING(p)) {
>   KERNEL_LOCK();
>   while ((signum = CURSIG(p)) != 0)
>   postsig(p, signum);
> Index: sys/signalvar.h
> ===
> RCS file: src/sys/sys/signalvar.h,v
> retrieving revision 1.29
> diff -u -p -r1.29 signalvar.h
> --- sys/signalvar.h   26 Feb 2018 13:33:25 -  1.29
> +++ sys/signalvar.h   20 Mar 2018 16:53:25 -
> @@ -66,6 +66,11 @@ struct sigacts {
>  #define  SIG_HOLD(void (*)(int))3
>  
>  /*
> + * Check if process p has an unmasked signal pending.
> + */
> +#define  SIGPENDING(p)   (((p)->p_siglist & ~(p)->p_sigmask) != 0)
> +
> +/*
>   * Determine signal that should be delivered to process p, the current
>   * process, 0 if none.  If there is a pending stop signal with default
>   * action, the process stops in issignal().

Re: NFS socket use after free during reboot

2018-03-26 Thread Martin Pieuchot

On 20/03/18(Tue) 20:09, Alexander Bluhm wrote:
> On Tue, Mar 20, 2018 at 02:24:40PM +0100, Martin Pieuchot wrote:
> > So here's a diff to add locking to NFS nodes.  I couldn't reproduce the
> > panic above with it.  So I'd be interested if you could try it.  Note
> > that I didn't do much tests in write mode, so I'd suggest exporting your
> > "/mount" as 'ro' in a first time.  Diskless setups are also probably
> > broken.
> 
> This diff fixes my reboot test case.  I was only using a read-only
> mount when I reported the panic.
> 
> But now the /usr/src/regress/sys/ffs/nfs test hangs in "nfsnode".

Because I forgot to unlock the parent's vnode in nfs_remove(), diff below
fixes that.

Index: nfs/nfs_node.c
===
RCS file: /cvs/src/sys/nfs/nfs_node.c,v
retrieving revision 1.65
diff -u -p -r1.65 nfs_node.c
--- nfs/nfs_node.c  27 Sep 2016 01:37:38 -  1.65
+++ nfs/nfs_node.c  20 Mar 2018 12:31:40 -
@@ -58,8 +58,6 @@
 struct pool nfs_node_pool;
 extern int prtactive;
 
-struct rwlock nfs_hashlock = RWLOCK_INITIALIZER("nfshshlk");
-
 /* XXX */
 extern struct vops nfs_vops;
 
@@ -98,12 +96,10 @@ nfs_nget(struct mount *mnt, nfsfh_t *fh,
nmp = VFSTONFS(mnt);
 
 loop:
-   rw_enter_write(&nfs_hashlock);
find.n_fhp = fh;
find.n_fhsize = fhsize;
np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find);
if (np != NULL) {
-   rw_exit_write(&nfs_hashlock);
vp = NFSTOV(np);
error = vget(vp, LK_EXCLUSIVE, p);
if (error)
@@ -120,25 +116,28 @@ loop:
 * to see if this nfsnode has been added while we did not hold
 * the lock.
 */
-   rw_exit_write(&nfs_hashlock);
error = getnewvnode(VT_NFS, mnt, &nfs_vops, &nvp);
/* note that we don't have this vnode set up completely yet */
-   rw_enter_write(&nfs_hashlock);
if (error) {
*npp = NULL;
-   rw_exit_write(&nfs_hashlock);
return (error);
}
nvp->v_flag |= VLARVAL;
-   np = RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find);
-   if (np != NULL) {
+   np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO);
+   /*
+* getnewvnode() and pool_get() can sleep, check for race.
+*/
+   if (RBT_FIND(nfs_nodetree, &nmp->nm_ntree, &find) != NULL) {
+   pool_put(&nfs_node_pool, np);
vgone(nvp);
-   rw_exit_write(&nfs_hashlock);
goto loop;
}
 
vp = nvp;
-   np = pool_get(&nfs_node_pool, PR_WAITOK | PR_ZERO);
+#ifdef VFSLCKDEBUG
+   vp->v_flag |= VLOCKSWORK;
+#endif
+   rrw_init_flags(&np->n_lock, "nfsnode", RWL_DUPOK | RWL_IS_VNODE);
vp->v_data = np;
/* we now have an nfsnode on this vnode */
vp->v_flag &= ~VLARVAL;
@@ -159,10 +158,11 @@ loop:
np->n_fhp = &np->n_fh;
bcopy(fh, np->n_fhp, fhsize);
np->n_fhsize = fhsize;
+   /* lock the nfsnode, then put it on the rbtree */
+   rrw_enter(&np->n_lock, RW_WRITE);
np2 = RBT_INSERT(nfs_nodetree, &nmp->nm_ntree, np);
KASSERT(np2 == NULL);
np->n_accstamp = -1;
-   rw_exit(&nfs_hashlock);
*npp = np;
 
return (0);
@@ -201,9 +201,10 @@ nfs_inactive(void *v)
 * Remove the silly file that was rename'd earlier
 */
nfs_vinvalbuf(ap->a_vp, 0, sp->s_cred, curproc);
+   vn_lock(sp->s_dvp, LK_EXCLUSIVE | LK_RETRY, curproc);
nfs_removeit(sp);
crfree(sp->s_cred);
-   vrele(sp->s_dvp);
+   vput(sp->s_dvp);
free(sp, M_NFSREQ, sizeof(*sp));
}
np->n_flag &= (NMODIFIED | NFLUSHINPROG | NFLUSHWANT);
@@ -239,9 +240,7 @@ nfs_reclaim(void *v)
ap->a_vp);
 #endif
nmp = VFSTONFS(vp->v_mount);
-   rw_enter_write(&nfs_hashlock);
RBT_REMOVE(nfs_nodetree, &nmp->nm_ntree, np);
-   rw_exit_write(&nfs_hashlock);
 
if (np->n_rcred)
crfree(np->n_rcred);
Index: nfs/nfs_vfsops.c
===
RCS file: /cvs/src/sys/nfs/nfs_vfsops.c,v
retrieving revision 1.116
diff -u -p -r1.116 nfs_vfsops.c
--- nfs/nfs_vfsops.c10 Feb 2018 05:24:23 -  1.116
+++ nfs/nfs_vfsops.c20 Mar 2018 10:27:24 -
@@ -178,7 +178,7 @@ nfs_statfs(struct mount *mp, struct stat
copy_statfs_info(sbp, mp);
m_freem(info.nmi_mrep);
 nfsmout: 
-   vrele(vp);
+   vput(vp);
crfree(cred);

modesetting driver broke video(1)

2018-03-29 Thread Martin Pieuchot

Since we switched to the modesetting driver by default, the supported
XvImage formats no longer include YUY2 nor UYVY which are expected by
video(1).  Using the following Xorg.conf makes video(1) works again.

Section "Device"
  Identifier "Device0"
  Driver "intel"
EndSection

Attached are the outputs of xvinfo(1) with the modesetting driver and
the intel driver.
X-Video Extension version 2.2
screen #0
  Adaptor #0: "GLAMOR Textured Video"
number of ports: 16
port base: 96
operations supported: PutImage 
supported visuals:
  depth 24, visualID 0x21
number of attributes: 5
  "XV_BRIGHTNESS" (range -1000 to 1000)
  client settable attribute
  client gettable attribute (current value is 0)
  "XV_CONTRAST" (range -1000 to 1000)
  client settable attribute
  client gettable attribute (current value is 0)
  "XV_SATURATION" (range -1000 to 1000)
  client settable attribute
  client gettable attribute (current value is 0)
  "XV_HUE" (range -1000 to 1000)
  client settable attribute
  client gettable attribute (current value is 0)
  "XV_COLORSPACE" (range 0 to 1)
  client settable attribute
  client gettable attribute (current value is 0)
maximum XvImage size: 8192 x 8192
Number of image formats: 2
  id: 0x32315659 (YV12)
guid: 59563132--0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
  id: 0x30323449 (I420)
guid: 49343230--0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
X-Video Extension version 2.2
screen #0
  Adaptor #0: "Intel(R) Textured Video"
number of ports: 16
port base: 75
operations supported: PutImage 
supported visuals:
  depth 24, visualID 0x20
number of attributes: 1
  "XV_SYNC_TO_VBLANK" (range -1 to 1)
  client settable attribute
  client gettable attribute (current value is 1)
maximum XvImage size: 16384 x 16384
Number of image formats: 5
  id: 0x32595559 (YUY2)
guid: 59555932--0010-8000-00aa00389b71
bits per pixel: 16
number of planes: 1
type: YUV (packed)
  id: 0x32315659 (YV12)
guid: 59563132--0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
  id: 0x30323449 (I420)
guid: 49343230--0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
  id: 0x59565955 (UYVY)
guid: 55595659--0010-8000-00aa00389b71
bits per pixel: 16
number of planes: 1
type: YUV (packed)
  id: 0x434d5658 (XVMC)
guid: 58564d43--0010-8000-00aa00389b71
bits per pixel: 12
number of planes: 3
type: YUV (planar)
  Adaptor #1: "Intel(R) Video Sprite"
number of ports: 1
port base: 91
operations supported: PutImage 
supported visuals:
  depth 24, visualID 0x20
number of attributes: 2
  "XV_COLORKEY" (range 0 to 16777215)
  client settable attribute
  client gettable attribute (current value is 66046)
  "XV_ALWAYS_ON_TOP" (range 0 to 1)
  client settable attribute
  client gettable attribute (current value is 0)
maximum XvImage size: 8192 x 8192
Number of image formats: 3
  id: 0x32595559 (YUY2)
guid: 59555932--0010-8000-00aa00389b71
bits per pixel: 16
number of planes: 1
type: YUV (packed)
  id: 0x59565955 (UYVY)
guid: 55595659--0010-8000-00aa00389b71
bits per pixel: 16
number of planes: 1
type: YUV (packed)
  id: 0x18424752
guid: 50415353-5448-524f-5547-485247423234
bits per pixel: 32
number of planes: 1
type: RGB (packed)
depth: 24
red, green, blue masks: 0xff, 0xff00, 0xff

Re: Kernel Panic on 6.2 amd64 when run0 RT3070 based device is attached during boot

2018-04-02 Thread Martin Pieuchot

gh-speed, mmc high-speed, dma
> pchb2 at pci0 dev 24 function 0 "AMD AMD64 16h Link Cfg" rev 0x00
> pchb3 at pci0 dev 24 function 1 "AMD AMD64 16h Address Map" rev 0x00
> pchb4 at pci0 dev 24 function 2 "AMD AMD64 16h DRAM Cfg" rev 0x00
> km0 at pci0 dev 24 function 3 "AMD AMD64 16h Misc Cfg" rev 0x00
> pchb5 at pci0 dev 24 function 4 "AMD AMD64 16h CPU Power" rev 0x00
> pchb6 at pci0 dev 24 function 5 vendor "AMD", unknown product 0x1535 rev
> 0x00
> usb3 at ohci0: USB revision 1.0
> uhub3 at usb3 configuration 1 interface 0 "AMD OHCI root hub" rev
> 1.00/1.00 addr 1
> usb4 at ohci1: USB revision 1.0
> uhub4 at usb4 configuration 1 interface 0 "AMD OHCI root hub" rev
> 1.00/1.00 addr 1
> isa0 at pcib0
> isadma0 at isa0
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5 irq 1 irq 12
> pcppi0 at isa0 port 0x61
> spkr0 at pcppi0
> vmm0 at mainbus0: SVM/RVI
> sdmmc0: can't enable card
> axen0 at uhub0 port 1 configuration 1 interface 0 "ASIX Elec. Corp.
> AX88179" rev 3.00/1.00 addr 2
> axen0: AX88179, address xx:xx:xx:xx:xx:xx
> rgephy0 at axen0 phy 3: RTL8169S/8110S/8211 PHY, rev. 5
> uhidev0 at uhub3 port 1 configuration 1 interface 0 "Dell Dell Smart
> Card Reader Keyboard" rev 2.00/1.00 addr 2
> uhidev0: iclass 3/1
> ukbd0 at uhidev0: 8 variable keys, 6 key codes
> wskbd0 at ukbd0: console keyboard, using wsdisplay0
> ugen0 at uhub3 port 1 configuration 1 "Dell Dell Smart Card Reader
> Keyboard" rev 2.00/1.00 addr 2
> vscsi0 at root
> scsibus2 at vscsi0: 256 targets
> softraid0 at root
> scsibus3 at softraid0: 256 targets
> softraid0: sd1 was not shutdown properly
> sd1 at scsibus3 targ 1 lun 0:  SCSI2 0/direct fixed
> sd1: 476937MB, 512 bytes/sector, 976767473 sectors
> root on sd1a (9990ff6713f15d12.a) swap on sd1b dump on sd1b
> WARNING: / was not properly unmounted
> --
> 
> Denis
> 
> On 1/25/2018 5:34 PM, Martin Pieuchot wrote:
> > Hello Denis,
> > 
> > On 25/01/18(Thu) 17:16, Denis wrote:
> >> Finally catch kernel panic in the middle of run adapter work.
> > 
> > Could you please set ddb.panic to 1?
> > 
> > It's hard to figure out what's wrong in your reports because as soon as
> > your machine tries to reboot it panics, panics and panics again.  So we
> > can't tell what is the first (real) problem.
> > 
> > And please stop cross posting.  bugs@ is enough for such problems :)
> > 
> > Thanks,
> > Martin
> >

Re: amd64/machdep knob: forceukb forcing wrong encoding.

2018-04-10 Thread Martin Pieuchot

On 05/02/18(Mon) 18:31, Artturi Alm wrote:
> On Mon, Feb 05, 2018 at 02:51:48PM +0100, Martin Pieuchot wrote:
> > On 04/02/18(Sun) 11:28, Artturi Alm wrote:
> > > Hi,
> > > 
> > > machdep.forceukbd=1 feels broken to me, as i use "sv", and it doesn't 
> > > respect
> > > /etc/kbdtype.
> > 
> > If you unplug/replug your USB keyboard after having booted does it
> > respect /etc/kbdtype?
> 
> Yes, no issues when machdep.forceukbd=0, and i do that unplug/replug-dance
> "in software" several times a day, as i use the same mouse+keyboard
> on my VM for games.

Diff below fixes the problem.  Turns out that the layout configured with
kbd(8) is stored in the mux.  But the value of the mux wasn't read for
console keyboard since it is supposed to attach first.

Index: dev/wscons/wskbd.c
===
RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v
retrieving revision 1.90
diff -u -p -r1.90 wskbd.c
--- dev/wscons/wskbd.c  19 Feb 2018 08:59:52 -  1.90
+++ dev/wscons/wskbd.c  27 Mar 2018 11:35:51 -
@@ -373,21 +373,11 @@ wskbd_attach(struct device *parent, stru
 #endif
 #if NWSMUX > 0
mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux;
-   if (ap->console) {
-   /* Ignore mux for console; it always goes to the console mux. */
-   /* printf(" (mux %d ignored for console)", mux); */
-   mux = -1;
-   }
if (mux >= 0) {
printf(" mux %d", mux);
wsmux_sc = wsmux_getmux(mux);
} else
wsmux_sc = NULL;
-#else
-#if 0  /* not worth keeping, especially since the default value is not -1... */
-   if (sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux >= 0)
-   printf(" (mux ignored)");
-#endif
 #endif /* NWSMUX > 0 */
 
if (ap->console) {
@@ -462,7 +452,8 @@ wskbd_attach(struct device *parent, stru
printf("\n");
 
 #if NWSMUX > 0
-   if (wsmux_sc != NULL) {
+   /* Ignore mux for console; it always goes to the console mux. */
+   if (wsmux_sc != NULL && ap->console == 0) {
error = wsmux_attach_sc(wsmux_sc, &sc->sc_base);
if (error)
printf("%s: attach error=%d\n",

Re: amd64/machdep knob: forceukb forcing wrong encoding.

2018-04-10 Thread Martin Pieuchot

On 10/04/18(Tue) 11:57, Mark Kettenis wrote:
> > Date: Tue, 27 Mar 2018 13:40:02 +0200
> > From: Martin Pieuchot 
> > 
> > On 05/02/18(Mon) 18:31, Artturi Alm wrote:
> > > On Mon, Feb 05, 2018 at 02:51:48PM +0100, Martin Pieuchot wrote:
> > > > On 04/02/18(Sun) 11:28, Artturi Alm wrote:
> > > > > Hi,
> > > > > 
> > > > > machdep.forceukbd=1 feels broken to me, as i use "sv", and it doesn't 
> > > > > respect
> > > > > /etc/kbdtype.
> > > > 
> > > > If you unplug/replug your USB keyboard after having booted does it
> > > > respect /etc/kbdtype?
> > > 
> > > Yes, no issues when machdep.forceukbd=0, and i do that unplug/replug-dance
> > > "in software" several times a day, as i use the same mouse+keyboard
> > > on my VM for games.
> > 
> > Diff below fixes the problem.  Turns out that the layout configured with
> > kbd(8) is stored in the mux.  But the value of the mux wasn't read for
> > console keyboard since it is supposed to attach first.
> > 
> > Index: dev/wscons/wskbd.c
> > ===
> > RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v
> > retrieving revision 1.90
> > diff -u -p -r1.90 wskbd.c
> > --- dev/wscons/wskbd.c  19 Feb 2018 08:59:52 -  1.90
> > +++ dev/wscons/wskbd.c  27 Mar 2018 11:35:51 -
> > @@ -373,21 +373,11 @@ wskbd_attach(struct device *parent, stru
> >  #endif
> >  #if NWSMUX > 0
> > mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux;
> > -   if (ap->console) {
> > -   /* Ignore mux for console; it always goes to the console mux. */
> > -   /* printf(" (mux %d ignored for console)", mux); */
> > -   mux = -1;
> > -   }
> > if (mux >= 0) {
> > printf(" mux %d", mux);
> 
> Should this printf be skipped for the console?

I don't mind, if we go this way here's a diff. 

Index: dev/wscons/wskbd.c
===
RCS file: /cvs/src/sys/dev/wscons/wskbd.c,v
retrieving revision 1.90
diff -u -p -r1.90 wskbd.c
--- dev/wscons/wskbd.c  19 Feb 2018 08:59:52 -  1.90
+++ dev/wscons/wskbd.c  10 Apr 2018 10:37:53 -
@@ -362,7 +362,7 @@ wskbd_attach(struct device *parent, stru
struct wskbddev_attach_args *ap = aux;
kbd_t layout;
 #if NWSMUX > 0
-   struct wsmux_softc *wsmux_sc;
+   struct wsmux_softc *wsmux_sc = NULL;
int mux, error;
 #endif
 
@@ -373,21 +373,8 @@ wskbd_attach(struct device *parent, stru
 #endif
 #if NWSMUX > 0
mux = sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux;
-   if (ap->console) {
-   /* Ignore mux for console; it always goes to the console mux. */
-   /* printf(" (mux %d ignored for console)", mux); */
-   mux = -1;
-   }
-   if (mux >= 0) {
-   printf(" mux %d", mux);
+   if (mux >= 0)
wsmux_sc = wsmux_getmux(mux);
-   } else
-   wsmux_sc = NULL;
-#else
-#if 0  /* not worth keeping, especially since the default value is not -1... */
-   if (sc->sc_base.me_dv.dv_cfdata->wskbddevcf_mux >= 0)
-   printf(" (mux ignored)");
-#endif
 #endif /* NWSMUX > 0 */
 
if (ap->console) {
@@ -459,14 +446,14 @@ wskbd_attach(struct device *parent, stru
printf(", using %s", sc->sc_displaydv->dv_xname);
 #endif
}
-   printf("\n");
 
 #if NWSMUX > 0
-   if (wsmux_sc != NULL) {
+   /* Ignore mux for console; it always goes to the console mux. */
+   if (wsmux_sc != NULL && ap->console == 0) {
+   printf(" mux %d", mux);
error = wsmux_attach_sc(wsmux_sc, &sc->sc_base);
if (error)
-   printf("%s: attach error=%d\n",
-   sc->sc_base.me_dv.dv_xname, error);
+   printf(": attach error=%d", error);
 
/*
 * Try and set this encoding as the mux default if it
@@ -479,6 +466,7 @@ wskbd_attach(struct device *parent, stru
wsmux_set_layout(wsmux_sc, layout);
}
 #endif
+   printf("\n");
 
 #if NWSDISPLAY > 0 && NWSMUX == 0
if (ap->console == 0) {

Re: Thunar dies and dumps core

2018-04-11 Thread Martin Pieuchot

On 10/04/18(Tue) 19:49, sudhir kumar lal wrote:
> Hi,
> 
>     I use CWM and snapshot of OpenBSD 6.3 and thunar crashes a lot on my
> system too. But it opens files on my system nicely, it only crashes when i
> use Shift+Delete to delete a file. then it core dumps and dies almost every
> time!

It's due to a race in the kqueue(2) backend.  Here's a diff for devel/glib2
that should improve the situation.

I'm going to submit the diff below, commit 1124732 upstream.

Index: Makefile
===
RCS file: /cvs/ports/devel/glib2/Makefile,v
retrieving revision 1.270
diff -u -p -r1.270 Makefile
--- Makefile20 Feb 2018 16:59:19 -  1.270
+++ Makefile11 Apr 2018 14:21:00 -
@@ -9,7 +9,7 @@ COMMENT=general-purpose utility librar
 GNOME_PROJECT= glib
 GNOME_VERSION= 2.54.3
 PKGNAME=   ${DISTNAME:S/glib/glib2/}
-REVISION=  1
+REVISION=  2
 
 CATEGORIES=devel
 
Index: patches/patch-00_kqueue_fix
===
RCS file: patches/patch-00_kqueue_fix
diff -N patches/patch-00_kqueue_fix
--- /dev/null   1 Jan 1970 00:00:00 -
+++ patches/patch-00_kqueue_fix 11 Apr 2018 14:26:44 -
@@ -0,0 +1,2060 @@
+commit aa39a0557c679fc345b0ba72a87c33152eb8ebcd
+Author: Martin Pieuchot 
+Date:   Tue Feb 20 16:57:00 2018 +
+
+kqueue: Multiple fixes and simplifications
+
+ - Stop using a custom thread for listening to kqueue(2) events.  Instead
+   call kevent(2) in non blocking mode in a monitor callback.  Under the
+   hood poll(2) is used to figure out if new events are available.
+
+ - Do not use a socketpair with a custom protocol requiring 2 supplementary
+   context switches per event to commicate between multiple threads.  
Calling
+   kevent(2), in non blocking mode, to add/remove events is fine from any
+   context.
+
+ - Add kqueue(2) events without the EV_ONESHOT flag.  This removes a race
+   where some notifications were lost because events had to be re-added for
+   every new notification.
+
+ - Get rid of the global hash table and its associated lock and races.  Use
+   the 'cookie' argument of kevent(2) to pass the associated descriptor 
when
+   registering an event.
+
+ - Fix _kh_file_appeared_cb() by properly passing a monitor instead of a
+   source to g_file_monitor_emit_event().
+
+ - Properly refcount sources.
+
+ - Remove a lot of abstraction making it harder to fix the remaining 
issues.
+
+https://bugzilla.gnome.org/show_bug.cgi?id=739424
+
+diff --git gio/kqueue/Makefile.am gio/kqueue/Makefile.am
+index d5657d7e4..24e9724e5 100644
+--- gio/kqueue/Makefile.am
 gio/kqueue/Makefile.am
+@@ -4,19 +4,9 @@ noinst_LTLIBRARIES += libkqueue.la
+ 
+ libkqueue_la_SOURCES = \
+gkqueuefilemonitor.c \
+-   gkqueuefilemonitor.h \
+kqueue-helper.c \
+kqueue-helper.h \
+-   kqueue-thread.c \
+-   kqueue-thread.h \
+-   kqueue-sub.c \
+-   kqueue-sub.h \
+kqueue-missing.c \
+-   kqueue-missing.h \
+-   kqueue-utils.c \
+-   kqueue-utils.h \
+-   kqueue-exclusions.c \
+-   kqueue-exclusions.h \
+dep-list.c \
+dep-list.h \
+$(NULL)
+diff --git gio/kqueue/gkqueuefilemonitor.c gio/kqueue/gkqueuefilemonitor.c
+index 78b749637..deed8b1e1 100644
+--- gio/kqueue/gkqueuefilemonitor.c
 gio/kqueue/gkqueuefilemonitor.c
+@@ -22,33 +22,73 @@
+ 
+ #include "config.h"
+ 
+-#include "gkqueuefilemonitor.h"
+-#include "kqueue-helper.h"
+-#include "kqueue-exclusions.h"
++#include 
++#include 
++#include 
++#include 
++#include 
++
++#include 
++#include 
++#include 
++
++#include 
++#include 
++#include 
++#include 
+ #include 
+ #include 
+-#include 
++#include 
++#include "glib-private.h"
++
++#include "kqueue-helper.h"
++#include "dep-list.h"
++
++G_LOCK_DEFINE_STATIC (kq_lock);
++static GSource   *kq_source;
++static int  kq_queue = -1;
++
++#define G_TYPE_KQUEUE_FILE_MONITOR(g_kqueue_file_monitor_get_type ())
++#define G_KQUEUE_FILE_MONITOR(inst)   (G_TYPE_CHECK_INSTANCE_CAST ((inst), \
++  G_TYPE_KQUEUE_FILE_MONITOR, 
GKqueueFileMonitor))
+ 
++typedef GLocalFileMonitorClass GKqueueFileMonitorClass;
+ 
+-struct _GKqueueFileMonitor
++typedef struct
+ {
+   GLocalFileMonitor parent_instance;
+ 
+   kqueue_sub *sub;
+-
++#ifndef O_EVTONLY
+   GFileMonitor *fallback;
+   GFile *fbfile;
+-};
++#endif
++} GKqueueFileMonitor;
++
++GType g_kqueue_file_monitor_get_type (void);
++G_DEFINE_TYPE_WITH_CODE (GKqueueFileMonitor, g_kqueue_file_monitor, 
G_TYPE_LOCAL_FILE_MONITOR,
++  g_io_extension_point_implement 
(G_LOCAL_FILE_MONITOR_EXTENSION_POINT_NAME,
++

Re: ddb(4): p[rint] man page example vs. result.

2018-05-09 Thread Martin Pieuchot

On 09/05/18(Wed) 07:48, Artturi Alm wrote:
> On Tue, May 08, 2018 at 01:44:39AM +0300, Artturi Alm wrote:

No bug are irrelevant to fix.  But working with you is hard, really
hard.  You never explain what the problem is.  Reading your email is
an exercise in frustration because you can do some good work but you
fail to communicate.

> > (manual "copypaste"):
> > nc2k4hp# sysctl ddb.trigger=1
> > Stopped at  db_enter+0x4:   popl%ebp
> > ddb{0}> print/x "eax = " $eax "\necx = " $ecx "\n"
> > 3
> > ddb{0}> c
> > ddb.trigger: 0 -> 1
> > 
> > so, for reasons yet unknown to me, p[rint] doesn't seem to work at all
> > like described in the man page, tested on i386.

What do no work?  What does the man page describe?  Do you expect us to
read the man page, then look at your mail again, then try to understand
what is not working? 

> > Should it work? I hope it would.

What should work?  Why do you hope?  Maybe the manpage should be fixed?

> Does feel like waste of time to go any further fixing this, if this is
> yet another bug too irrelevant for anyone to ack for, so _any_ input
> here would be great.

Like I said, no bug are irrelevant but if the one finding the bug, you
in that case, is not willing to properly explain the problem, then
better not send an email at all ;)

Re: ddb(4): p[rint] man page example vs. result.

2018-05-09 Thread Martin Pieuchot

On 09/05/18(Wed) 12:13, Artturi Alm wrote:
> On Wed, May 09, 2018 at 10:23:41AM +0200, Martin Pieuchot wrote:
> > On 09/05/18(Wed) 07:48, Artturi Alm wrote:
> > > On Tue, May 08, 2018 at 01:44:39AM +0300, Artturi Alm wrote:
> > 
> > 
> > No bug are irrelevant to fix.  But working with you is hard, really
> > hard.  You never explain what the problem is.  Reading your email is
> > an exercise in frustration because you can do some good work but you
> > fail to communicate.
> > 
> > > > (manual "copypaste"):
> > > > nc2k4hp# sysctl ddb.trigger=1
> > > > Stopped at  db_enter+0x4:   popl%ebp
> > > > ddb{0}> print/x "eax = " $eax "\necx = " $ecx "\n"
> > > > 3
> > > > ddb{0}> c
> > > > ddb.trigger: 0 -> 1
> > > > 
> > > > so, for reasons yet unknown to me, p[rint] doesn't seem to work at all
> > > > like described in the man page, tested on i386.
> > 
> > What do no work?  What does the man page describe?  Do you expect us to
> > read the man page, then look at your mail again, then try to understand
> > what is not working? 
> > 
> 
> For example,
> 
>   print/x "eax = " $eax "\necx = " $ecx "\n"
> 
> will print something like this:
> 
>   eax = xx
>   ecx = yy
> 
> Now I did install 5.0 into a VM, and there the result for above example
> would of have been just "Ambiguous", and I'm guessing now that this
> has not been working as in the example since import.
> My fix is limited to producing output just like in the example, but
> input requires more, as it needs escapes for everything not a-z,A-Z,0-9.
> 
> > > > Should it work? I hope it would.
> > 
> > What should work?  Why do you hope?  Maybe the manpage should be fixed?
> > 
> 
> Multiple [addr] arguments to p[rint], including support for strings,
> and i hope so because i would find it useful while testing/writing/porting
> drivers. Maybe, I do like "show struct", and have more than just
> the filtering diff for it, but it doesn't really work for the ad hoc
> usecases p[rint] seems so excellent for.
> 
> > > Does feel like waste of time to go any further fixing this, if this is
> > > yet another bug too irrelevant for anyone to ack for, so _any_ input
> > > here would be great.
> > 
> > Like I said, no bug are irrelevant but if the one finding the bug, you
> > in that case, is not willing to properly explain the problem, then
> > better not send an email at all ;)
> 
> Will try in the future.

Thanks for the explanation!

> haven't tested the diff below yet, but compared to previous, it should
> have working /modifierS.

IMHO we should just amend the man page and keep ddb(4) code simple.

Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c

2018-05-10 Thread Martin Pieuchot

On 08/05/18(Tue) 22:26, Michael-John Turner wrote:
> [...]
> ndp info overwritten for fe80:d::b408:97aa:a658:760e by 40:85:1b:ab:69:d5 on 
> vlan41
> ndp info overwritten for fe80:c::b408:97aa:a658:760e by c8:00:44:93:05:62 on 
> vlan40

Could you post your routing table so we can understand which ND entries
are overwritten and if it is normal?

Re: (bug || timewaste)usr.bin/ctfconv: should vlen be 0 for CTF_K_ARRAYs ?

2018-05-13 Thread Martin Pieuchot

On 13/05/18(Sun) 05:36, Artturi Alm wrote:
> Hi,
> 
> 
> I was looking at fixing my code for ctf pprinting arrays in ddb(4),
> and came across ctf in section 5 man pages for freebsd with google,
> which lead me to wondering about this, and even think about possibility
> of an bug here, since the ctf(5)[0] mostly matches what i've seen so
> far in OpenBSD otherwise(didn't see direct asserts/ifs yet to make
> sure CTF_K_ARRAY is always handled in the ctf_stype short form thought).
> 
> In it, under "Type Encoding" vlen is described like:
>   +o   The length of the variable data
> 
> and under "Encoding of Arrays" has this:
> "Arrays, which are of type CTF_K_ARRAY, have no variable length arguments."
> 
> so the above doesn't hold currently, should it?

You can check yourself by comparing the generated CTF from
devel/ctftools.  If you find out we do not generate the same
data as the reference, then it's a bug.

> 
> While nearly on-topic, is there any definitive docs for CTF?
> + typofix for making up the use of bugs@; sorry:)
> 
> -Artturi
> 
> [0] https://www.freebsd.org/cgi/man.cgi?query=ctf
> 
> 
> diff --git usr.bin/ctfconv/generate.c usr.bin/ctfconv/generate.c
> index e19094fe231..299c0d12eb6 100644
> --- usr.bin/ctfconv/generate.c
> +++ usr.bin/ctfconv/generate.c
> @@ -183,7 +183,7 @@ imcs_add_type(struct imcs *imcs, struct itype *it)
>  
>   assert(it->it_type != CTF_K_UNKNOWN && it->it_type != CTF_K_FORWARD);
>  
> - vlen = it->it_nelems;
> + vlen = it->it_type != CTF_K_ARRAY ? it->it_nelems : 0;
>   size = it->it_size;
>   kind = it->it_type;
>   root = 0;
> diff --git usr.bin/ctfconv/itype.h usr.bin/ctfconv/itype.h
> index 408a2140558..c4878f2783e 100644
> --- usr.bin/ctfconv/itype.h
> +++ usr.bin/ctfconv/itype.h
> @@ -36,7 +36,7 @@ struct itype {
>   TAILQ_ENTRY(itype)   it_symb;   /* itype: global queue of symbol */
>   RB_ENTRY(itype)  it_node;   /* itype: per-type tree of types */
>  
> - SIMPLEQ_HEAD(, itref)it_refs;   /* itpye: backpointing refs */
> + SIMPLEQ_HEAD(, itref)it_refs;   /* itype: backpointing refs */
>  
>   TAILQ_HEAD(, imember)it_members;/* itype: members of struct/union */
>  
>

Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c

2018-05-14 Thread Martin Pieuchot

On 13/05/18(Sun) 23:16, Michael-John Turner wrote:
> Hi,
> 
> On Thu, May 10, 2018 at 05:13:17PM +0200, Alexander Bluhm wrote:
> > When an IPv6 neigbor discovery timeout occurs, the kernel tries to
> > remove the NDP entry.  It is stored in the routing table.  The
> > problem is that this NDP route suddenly has a locally configured
> > address.
> 
> Did you perhaps spot anything in the files I made available? The crashes
> have continued daily, I'm guessing when the problematic entry in the NDP
> table expires. I've tried tweaking various settings and have removed some of
> the unusual parts of my setup (moving some of the subnets which shared an
> interface onto their own VLANs, for example), but nothing has helped :(
> Same panic in the same location.
> 
> Happy to provide any further information that you think may help diagnose
> the problem.
> 
> Thanks in advance :)

Could you try the diff below and as soon as you see the message in the
dmesg, get the output of 'route -n show -inet6' and send us both?

Index: nd6.c
===
RCS file: /cvs/src/sys/netinet6/nd6.c,v
retrieving revision 1.224
diff -u -p -r1.224 nd6.c
--- nd6.c   2 May 2018 07:19:45 -   1.224
+++ nd6.c   14 May 2018 13:12:03 -
@@ -722,7 +722,16 @@ nd6_free(struct rtentry *rt)
}
}
 
-   KASSERT(!ISSET(rt->rt_flags, RTF_LOCAL));
+   if (ISSET(rt->rt_flags, RTF_LOCAL)) {
+   char ip[INET6_ADDRSTRLEN];
+
+   printf("%s: called for %s on %s\n", __func__,
+   inet_ntop(AF_INET6, &satosin6(rt_key(rt))->sin6_addr, ip,
+   sizeof(ip)),
+   ifp->if_xname);
+   if_put(ifp);
+   return;
+   }
nd6_invalidate(rt);
 
/*

firefox 60.0 / "modesetting" / pledge

2018-05-15 Thread Martin Pieuchot

After upgrading to the last packaged version of firefox, browsing become
once again unusable.  This time the problem seems due to rendering, as
switching back to the "intel" driver made the rendering of the pages
normal again.  With the "modesetting" driver it takes multiple seconds
and scrolling is not smooth.

$ pkg_info -q|grep firefox
firefox-60.0

On top of that, trying to download a file using the "save as" menu
result in a pledge problem:

firefox[35328]: pledge "getpw", syscall 33
firefox[35328]: pledge "stdio", syscall 87
firefox[91987]: pledge "getpw", syscall 33

Workaround:

$ cat /etc/X11/xorg.conf 
Section "Device"
  Identifier "Device0"
  Driver "intel"
EndSection


OpenBSD 6.3-current (GENERIC.MP) #38: Wed May  9 17:38:06 MDT 2018
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 8238301184 (7856MB)
avail mem = 7980589056 (7610MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xccbfd000 (65 entries)
bios0: vendor LENOVO version "N14ET26W (1.04 )" date 01/23/2015
bios0: LENOVO 20BS006BGE
acpi0 at bios0: rev 2
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SLIC ASF! HPET ECDT APIC MCFG SSDT SSDT SSDT SSDT SSDT 
SSDT SSDT SSDT SSDT SSDT PCCT SSDT UEFI MSDM BATB FPDT UEFI DMAR
acpi0: wakeup devices LID_(S4) SLPB(S3) IGBE(S4) EXP2(S4) XHCI(S3) EHC1(S3)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 14318179 Hz
acpiec0 at acpi0
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2295.09 MHz
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 1, core 0, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 1, package 0
cpu3 at mainbus0: apid 3 (application processor)
cpu3: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz, 2294.70 MHz
cpu3: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,SENSOR,ARAT,MELTDOWN
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 1, core 1, package 0
ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 40 pins
acpimcfg0 at acpi0 addr 0xf800, bus 0-63
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus -1 (PEG_)
acpiprt2 at acpi0: bus 3 (EXP1)
acpiprt3 at acpi0: bus 4 (EXP2)
acpiprt4 at acpi0: bus -1 (EXP3)
acpiprt5 at acpi0: bus -1 (EXP6)
acpicpu0 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), 
C1(1000@1 mwait.1), PSS
acpicpu1 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), 
C1(1000@1 mwait.1), PSS
acpicpu2 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), 
C1(1000@1 mwait.1), PSS
acpicpu3 at acpi0: C3(200@233 mwait.1@0x40), C2(200@148 mwait.1@0x33), 
C1(1000@1 mwait.1), PSS
acpipwrres0 at acpi0: PUBS, resource for XHCI, EHC1
acpipwrres1 at acpi0: NVP3, resource for PEG_
acpipwrres2 at acpi0: NVP2, resource for PEG_
acpitz0 at acpi0: critical temperature is 128 degC
acpib

Re: 6.3 just died (not for the first time)

2018-05-16 Thread Martin Pieuchot

On 16/05/18(Wed) 08:06, Harald Dunkel wrote:
> Hi folks,

Thanks for the report.
 
> hopefully its allowed to repost this message here:
> 
> One gateway running 6.3 ran into the debugger last night. Last words:
> 
> login: kernel: protection fault trap, code=0
> Stopped at  export_sa+0x5c: movl0(%rcx),%ecx
> ddb{0}> show panic
> the kernel did not panic
> ddb{0}> trace
> export_sa(10,800033445e70) at export_sa+0x5c
> pfkeyv2_expire(813d4c00,813d4c00) at pfkeyv2_expire+0x14e
> tdb_timeout(800033446020) at tdb_timeout+0x39
> softclock_thread(0) at softclock_thread+0xc6
> end trace frame: 0x0, count: -4
> ddb{0}> show registers
> rdi   0x800033445e98
> rsi   0x813d4c00
> rbp   0x800033445e70
> rbx   0x800033445e98
> rdx   0x81abdff0cpu_info_full_primary+0x1ff0
> rcx   0xdeadbeefdeadbeef
^^
That means that the TDB has already been freed.  This is possible
because the timeout sleeps on the NET_LOCK().  Diff below should prevent
that by introducing a tdb_reaper() function like we do in other parts of
the stack.

Index: netinet/ip_ipsp.c
===
RCS file: /cvs/src/sys/netinet/ip_ipsp.c,v
retrieving revision 1.229
diff -u -p -r1.229 ip_ipsp.c
--- netinet/ip_ipsp.c   6 Nov 2017 15:12:43 -   1.229
+++ netinet/ip_ipsp.c   16 May 2018 08:17:59 -
@@ -79,10 +79,11 @@ void tdb_hashstats(void);
 #endif
 
 void   tdb_rehash(void);
-void   tdb_timeout(void *v);
-void   tdb_firstuse(void *v);
-void   tdb_soft_timeout(void *v);
-void   tdb_soft_firstuse(void *v);
+void   tdb_reaper(void *);
+void   tdb_timeout(void *);
+void   tdb_firstuse(void *);
+void   tdb_soft_timeout(void *);
+void   tdb_soft_firstuse(void *);
 inttdb_hash(u_int, u_int32_t, union sockaddr_union *, u_int8_t);
 
 int ipsec_in_use = 0;
@@ -541,14 +542,13 @@ tdb_timeout(void *v)
 {
struct tdb *tdb = v;
 
-   if (!(tdb->tdb_flags & TDBF_TIMER))
-   return;
-
NET_LOCK();
-   /* If it's an "invalid" TDB do a silent expiration. */
-   if (!(tdb->tdb_flags & TDBF_INVALID))
-   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD);
-   tdb_delete(tdb);
+   if (tdb->tdb_flags & TDBF_TIMER) {
+   /* If it's an "invalid" TDB do a silent expiration. */
+   if (!(tdb->tdb_flags & TDBF_INVALID))
+   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD);
+   tdb_delete(tdb);
+   }
NET_UNLOCK();
 }
 
@@ -557,14 +557,13 @@ tdb_firstuse(void *v)
 {
struct tdb *tdb = v;
 
-   if (!(tdb->tdb_flags & TDBF_SOFT_FIRSTUSE))
-   return;
-
NET_LOCK();
-   /* If the TDB hasn't been used, don't renew it. */
-   if (tdb->tdb_first_use != 0)
-   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD);
-   tdb_delete(tdb);
+   if (tdb->tdb_flags & TDBF_SOFT_FIRSTUSE) {
+   /* If the TDB hasn't been used, don't renew it. */
+   if (tdb->tdb_first_use != 0)
+   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_HARD);
+   tdb_delete(tdb);
+   }
NET_UNLOCK();
 }
 
@@ -573,13 +572,12 @@ tdb_soft_timeout(void *v)
 {
struct tdb *tdb = v;
 
-   if (!(tdb->tdb_flags & TDBF_SOFT_TIMER))
-   return;
-
NET_LOCK();
-   /* Soft expirations. */
-   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT);
-   tdb->tdb_flags &= ~TDBF_SOFT_TIMER;
+   if (tdb->tdb_flags & TDBF_SOFT_TIMER) {
+   /* Soft expirations. */
+   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT);
+   tdb->tdb_flags &= ~TDBF_SOFT_TIMER;
+   }
NET_UNLOCK();
 }
 
@@ -588,14 +586,13 @@ tdb_soft_firstuse(void *v)
 {
struct tdb *tdb = v;
 
-   if (!(tdb->tdb_flags & TDBF_SOFT_FIRSTUSE))
-   return;
-
NET_LOCK();
-   /* If the TDB hasn't been used, don't renew it. */
-   if (tdb->tdb_first_use != 0)
-   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT);
-   tdb->tdb_flags &= ~TDBF_SOFT_FIRSTUSE;
+   if (tdb->tdb_flags & TDBF_SOFT_FIRSTUSE) {
+   /* If the TDB hasn't been used, don't renew it. */
+   if (tdb->tdb_first_use != 0)
+   pfkeyv2_expire(tdb, SADB_EXT_LIFETIME_SOFT);
+   tdb->tdb_flags &= ~TDBF_SOFT_FIRSTUSE;
+   }
NET_UNLOCK();
 }
 
@@ -841,14 +838,6 @@ tdb_free(struct tdb *tdbp)
ipo->ipo_last_searched = 0; /* Force a re-search. */
}
 
-   /* Remove expiration timeouts. */
-   tdbp->tdb_flags &= ~(TDBF_FIRSTUSE | TDBF_SOFT_FIRSTUSE | TDBF_TIMER |
-   TDBF_SOFT_TIMER);
-   timeout_del(&tdbp->tdb_timer_tmo);
-   timeout_del(&tdbp->tdb_first_tmo);

Re: protection fault after fatfingering address

2018-05-21 Thread Martin Pieuchot

On 20/05/18(Sun) 21:10, Alexander Bluhm wrote:
> On Sun, May 20, 2018 at 07:24:05AM +0200, p...@centroid.eu wrote:
> > http://centroid.eu/private/p523.jpg
> 
> ml_enqueue+0x11
> /usr/src/sys/kern/uipc_mbuf.c:1498
> *   33a1:   48 89 71 08 mov%rsi,0x8(%rcx)
> 33a5:   eb 07   jmp33ae 
> 
>   1492  void
>   1493  ml_enqueue(struct mbuf_list *ml, struct mbuf *m)
>   1494  {
>   1495  if (ml->ml_tail == NULL)
>   1496  ml->ml_head = ml->ml_tail = m;
>   1497  else {
> * 1498  ml->ml_tail->m_nextpkt = m;
>   1499  ml->ml_tail = m;
>   1500  }
>   1501  
>   1502  m->m_nextpkt = NULL;
>   1503  ml->ml_len++;
>   1504  }
> 
> arpresolve+0x1bf
> /usr/src/sys/netinet/if_ether.c:383
>  954:   4c 89 ffmov%r15,%rdi
>  957:   4c 89 e6mov%r12,%rsi
>  95a:   e8 00 00 00 00  callq  95f 
> /usr/src/sys/netinet/if_ether.c:384
> *95f:   83 04 25 00 00 00 00addl   $0x1,0x0
> 
>373  la = (struct llinfo_arp *)rt->rt_llinfo;
>374  KASSERT(la != NULL);
>375  if (la_hold_total < LA_HOLD_TOTAL && la_hold_total < nmbclust 
> / 
> 64) {
>376  struct mbuf *mh;
>377  
>378  if (ml_len(&la->la_ml) >= LA_HOLD_QUEUE) {
>379  mh = ml_dequeue(&la->la_ml);
>380  la_hold_total--;
>381  m_freem(mh);
>382  }
> *  383  ml_enqueue(&la->la_ml, m);
>384  la_hold_total++;
>385  } else {
>386  la_hold_total -= ml_purge(&la->la_ml);
>387  m_freem(m);
>388  }
> 
> So the kernel crashes when it accesses the mbuf_list in the struct
> llinfo_arp.
> 
> > route change default -inet6 2001:db8:0:40::300
> 
> As the address families of the route is messed up, I guess that the
> cast in line 373 is wrong.  The data structure is a llinfo_nd6 and
> not a llinfo_arp.
> 
> I could not reproduce the crash, but my kernel accepts an IPv6
> gateway for the IPv4 default route.  This kernel diff prevents that
> user land can add or change such routes.
> 
> root@v74:.../~# route change default -inet6 fdd7:e83e:66bc:74::1234
> change net default: gateway fdd7:e83e:66bc:74::1234: Address family not 
> supported by protocol family

Are you sure this change won't introduce a regression with L2 route
entries?  These entries generally have a Ethernet address as gateway.

In any case it would be nice to add this problem to the route regression
test.

> Index: net/rtsock.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/rtsock.c,v
> retrieving revision 1.265
> diff -u -p -r1.265 rtsock.c
> --- net/rtsock.c  14 May 2018 07:33:59 -  1.265
> +++ net/rtsock.c  20 May 2018 19:02:08 -
> @@ -718,6 +718,14 @@ route_output(struct mbuf *m, struct sock
>   info.rti_flags |= RTF_LLINFO;
>   }
>  
> + if (info.rti_info[RTAX_DST] != NULL &&
> + info.rti_info[RTAX_GATEWAY] != NULL &&
> + info.rti_info[RTAX_DST]->sa_family !=
> + info.rti_info[RTAX_GATEWAY]->sa_family) {
> + error = EAFNOSUPPORT;
> + goto fail;
> + }
> +
>   /*
>* Validate RTM_PROPOSAL and pass it along or error out.
>*/
>

Re: 6.3 amd64 panic: kernel diagnostic assertion in nd6.c

2018-05-21 Thread Martin Pieuchot

On 17/05/18(Thu) 21:30, Michael-John Turner wrote:
> On Mon, May 14, 2018 at 03:13:12PM +0200, Martin Pieuchot wrote:
> > Could you try the diff below and as soon as you see the message in the
> > dmesg, get the output of 'route -n show -inet6' and send us both?
> 
> It's happened a few times since applying the patch but I've finally managed
> to get the route output at the right moment, as requested. The messages in
> dmesg_ndp_issue.txt have been flooding the message buffer for the last ~40
> minutes or so.
> 
> As the text may wrap a bit oddly if posted to the list, I've placed the
> files here:
> http://dl.rsx11.net/misc/dmesg_ndp_issue.txt
> http://dl.rsx11.net/misc/ndp_ndp_issue.txt
> http://dl.rsx11.net/misc/netstat_ndp_issue.txt
> http://dl.rsx11.net/misc/route_ndp_issue.txt
> 
> Any ideas what could be causing the problem?

No because you didn't send your dmesg.  I need the full dmesg, the
important part from your original message was:

ndp info overwritten for fe80:d::b408:97aa:a658:760e by 40:85:1b:ab:69:d5 on 
vlan41
ndp info overwritten for fe80:c::b408:97aa:a658:760e by c8:00:44:93:05:62 on 
vlan40

I need the corresponding information for the output you provided above.

I'm guessing the in-kernel state machine tries to overwrite a RTF_LOCAL
address and that should not happen.

Race between dup2(2) and accept(2)

2018-05-21 Thread Martin Pieuchot

If a process exit(3)s while one of its threads is blocking in accept(2)
and the half-opened descriptor has already been dup'ed, we get the
following panic:

panic: closef: count (1) < 2
Stopped at  db_enter+0x5:   popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*204115  80020  0  0x10030x80K dup2_accept
db_enter() at db_enter+0x5
panic() at panic+0x120
closef(ff000583d948,8000a020) at closef+0x145
doaccept(1e0,8000a020,1e,839a03c0,7f7c9d58,bc7efe80dd509fa1)
 at doaccept+0x2a3
syscall(1) at syscall+0x31d
Xsyscall_untramp(6,0,0,0,0,1e) at Xsyscall_untramp+0xc0
end of kernel

A test for this problem can be found there:
  https://marc.info/?l=openbsd-tech&m=152637351632752&w=2

Diff below prevents the problem by returning EBUSY in dup2(2) & friends
like Linux does when trying to dup an half-opened file.  I'd like to
reuse this logic to keep the future locking simple, ok?

Index: sys/kern/kern_descrip.c
===
RCS file: /cvs/src/sys/kern/kern_descrip.c,v
retrieving revision 1.158
diff -u -p -r1.158 kern_descrip.c
--- sys/kern/kern_descrip.c 8 May 2018 09:03:58 -   1.158
+++ sys/kern/kern_descrip.c 21 May 2018 12:12:50 -
@@ -634,13 +634,14 @@ finishdup(struct proc *p, struct file *f
return (EDEADLK);
}
 
-   /*
-* Don't fd_getfile here. We want to closef LARVAL files and
-* closef can deal with that.
-*/
oldfp = fdp->fd_ofiles[new];
-   if (oldfp != NULL)
+   if (oldfp != NULL) {
+   if (!FILE_IS_USABLE(oldfp)) {
+   FRELE(fp, p);
+   return (EBUSY);
+   }
FREF(oldfp);
+   }
 
fdp->fd_ofiles[new] = fp;
fdp->fd_ofileflags[new] = fdp->fd_ofileflags[old] & ~UF_EXCLOSE;
Index: lib/libc/sys//dup.2
===
RCS file: /cvs/src/lib/libc/sys/dup.2,v
retrieving revision 1.18
diff -u -p -r1.18 dup.2
--- lib/libc/sys//dup.2 10 Dec 2014 19:46:48 -  1.18
+++ lib/libc/sys//dup.2 21 May 2018 12:12:38 -
@@ -157,6 +157,10 @@ is not a valid active descriptor or
 is negative or greater than or equal to the process's
 .Dv RLIMIT_NOFILE
 limit.
+.It Bq Er EBUSY
+A race condition with
+.Xr accept 2
+has been detected.
 .It Bq Er EINTR
 An interrupt was received.
 .It Bq Er EIO

Re: protection fault trap with OpenBSD 6.3

2018-05-29 Thread Martin Pieuchot

On 28/05/18(Mon) 22:24, Marc Peters wrote:
> Hi List,
> 
> i am having issues with OpenBSD 6.3, latest patches as of today applied. We 
> are using gif-tunnels between our datacenters, transport encryption and 
> OpenBGPD to announce the prefixes between the datacenters. The boxes also 
> have isakmpd tunnels on a carp interface to AWS and GCP. The setup is working 
> fine with existing 6.1 boxes and there's no problem in pushing/receiving 
> several 100MBit/s (according to observium snmpd data, which gets constantly 
> collected). Switching the traffic to the 6.3 hosts, we get a freeze on one of 
> the boxes after about 45 minutes of transferring traffic (all IPv4 traffic in 
> our case for now):

This has been fixed in -current.

Re: bsd.mp hits witness panic under vmm (single CPU)

2018-06-08 Thread Martin Pieuchot

On 07/06/18(Thu) 19:22, Philip Guenther wrote:
> On Thu, 7 Jun 2018, Mike Larkin wrote:
> > Is this a panic inside the guest in vmm, or is this the host panicing when
> > you're doing something while a VM is running in vmm on that host?
> > 
> > Can't really tell from the trace here...
> 
> This was a guest panicing.  visa@ thinks this is the same intr_legacy8 
> panic as reported previously.

It is.  This is not a new issue.  We know legacy interrupts are not
mpsafe.

Re: Assertion failure when adding point-to-point routes to interfaces in rdomain with deleted loopback

2018-06-14 Thread Martin Pieuchot

On 06/06/18(Wed) 16:21, multiplexd wrote:
> >Synopsis:  Assertion failure when adding point-to-point routes to 
> >interfaces in rdomain with deleted loopback
> >Category:  Reliability
> >Environment:
> System  : OpenBSD 6.3
> Details : OpenBSD 6.3 (GENERIC) #3: Thu May 17 23:54:13 CEST 2018
>  
> r...@syspatch-63-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
> 
> Architecture: OpenBSD.amd64
> Machine : amd64 (see Description)
> >Description:
> 
> Adding a route to a point-to-point interface such as gre(4) or tun(4) where 
> the interface is in a
> non-default rdomain and the loopback device for the given rdomain has been 
> destroyed will trigger a
> kernel assertion failure, causing a system crash.
> 
> This issue has been observed and reproduced on both an amd64 system (virtual 
> machine on a Debian 9
> host) and a macppc system (iBook G4).
> 
> >How-To-Repeat:
> 
> 1) Create a new loopback device in a non-default rdomain. Example:
> 
> # ifconfig lo2 rdomain 2
> 
> 2) The following two steps can be performed in any order.
>   2a) Create a point-to-point interface. The following example creates a new 
> tun(4) interface,
>   though this has also been reproduced with a gre(4) interface.
> 
> # ifconfig tun0 rdomain 2
> 
>   2b) Delete the loopback device associated with the rdomain.
> 
> # ifconfig lo2 -rdomain destroy
> 
> 3) Add a route to the point-to-point interface, e.g.
> 
> # ifconfig tun0 inet 192.168.200.1 192.168.200.2
> 
>The system will crash and drop to a ddb(4) prompt.
> 
> An example session is shown below:
> 
> bsd00# ifconfig lo2 rdomain 2
> bsd00# ifconfig tun0 rdomain 2
> bsd00# ifconfig lo2 -rdomain destroy
> bsd00# ifconfig tun0 inet 192.168.200.1 192.168.200.2
> panic: kernel diagnostic assertion "lo0ifp != NULL" failed: file 
> "/usr/src/sys/net/if.c", line 1483

Thanks for the report, could you try the diff below?

Index: net/if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.554
diff -u -p -r1.554 if.c
--- net/if.c30 May 2018 22:20:41 -  1.554
+++ net/if.c14 Jun 2018 12:36:20 -
@@ -1765,9 +1765,11 @@ if_setrdomain(struct ifnet *ifp, int rdo
if (rdomain != rtable_l2(rdomain))
return (EINVAL);
 
-   /* remove all routing entries when switching domains */
-   /* XXX this is a bit ugly */
if (rdomain != ifp->if_rdomain) {
+   if ((ifp->if_flags & IFF_LOOPBACK) &&
+   (ifp->if_index == rtable_loindex(ifp->if_rdomain)))
+   return (EPERM);
+
s = splnet();
/*
 * We are tearing down the world.

Re: Assertion failure when adding point-to-point routes to interfaces in rdomain with deleted loopback

2018-06-18 Thread Martin Pieuchot

On 16/06/18(Sat) 23:31, multiplexd wrote:
> [...] 
> As a supplementary question, is it intended that (non-default) rdomains 
> cannot be "deleted" at runtime after they have been created?

Let's say that deletion hasn't been implemented.

Re: kernel_lock not locked

2018-07-01 Thread Martin Pieuchot

On 28/06/18(Thu) 14:53, Visa Hankala wrote:
> On Wed, Jun 27, 2018 at 08:46:04PM +0200, Landry Breuil wrote:
> > On Wed, Jun 27, 2018 at 05:37:54PM +0100, Laurence Tratt wrote:
> > > >Synopsis:kernel_lock not locked
> > > >Category:kernel
> > > >Environment:
> > >   System  : OpenBSD 6.3
> > >   Details : OpenBSD 6.3-current (GENERIC.MP) #55: Mon Jun 25 23:01:52 
> > > MDT 2018
> > >
> > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > 
> > >   Architecture: OpenBSD.amd64
> > >   Machine : amd64
> > > >Description:
> > >   I just hit the following kernel panic (a locking error in sched_bsd.c):
> > > 
> > >   https://imagebin.ca/v/46kV6Tfqe1sc
> > > 
> > >   I can hit this repeatedly by gdb'ing the new quodlibet 4.1.0 update that
> > >   Stuart just pushed to ports. It crashes at load; exactly at the point I
> > >   quit gdb the kernel panics. Here's the userland trace I get just before
> > >   the kernel panic occurs:
> > 
> > Fwiw, i've hit a similar panic (kernel_lock not locked) this weekend (on an 
> > up
> > to date kernel) when using egdb on ... firefox, of course.
> 
> There is a locking bug that gets triggered when a traced and stopped
> multithreaded process is forced to exit. When the bug hits, a thread
> calls exit1() with the kernel locked recursively:
> 
> sched_exit
> exit1
> single_thread_check
> single_thread_set
> issignal  <-- KERNEL_LOCK()
> userret  <-- KERNEL_LOCK()
> syscall
> Xsyscall_untramp
> 
> sched_exit() assumes that a single KERNEL_UNLOCK() releases the lock
> completely. However, the assumption is wrong in the above case.
> sched_exit() switches to the CPU's idle thread, which in turn calls
> mi_switch(). Then, mi_switch() tries to release the kernel lock (which
> is bound to the CPU, and which should not be locked in the first place).
> That causes a panic with WITNESS because WITNESS had associated the lock
> with the exiting thread and the lock is not found in the idle thread's
> lock list. That is why the panic's stack trace looks peculiar:
> 
> panic
> witness_unlock
> ___mp_release_all
> mi_switch
> sched_idle
> 
> Without WITNESS, the system would hang soon instead.
> 
> The bug can be fixed by making sched_exit() release the kernel lock
> completely. That would also make exit1() more agnostic with regard to
> the state of the lock. As an alternative, issignal() could avoid the
> recursive locking.
> 
> Comments? OK?

Thanks for your analyze.  So this is a regression introduced by the fix
for the previous TOCTOU race.

The kernel is currently grabbing the KERNEL_LOCK() in userret() to
serialize access to `ps_sigact'.  In the future we'll want to use finer
locks.  So my question is which fix goes in that direction?  The one
you posted or not grabbing the KERNEL_LOCK() in userret()?

If it doesn't matter, then I believe you should commit your fix, it is
ok mpi@.

> Index: kern/kern_sched.c
> ===
> RCS file: src/sys/kern/kern_sched.c,v
> retrieving revision 1.48
> diff -u -p -r1.48 kern_sched.c
> --- kern/kern_sched.c 19 Jun 2018 19:29:52 -  1.48
> +++ kern/kern_sched.c 28 Jun 2018 13:47:28 -
> @@ -218,8 +218,11 @@ sched_exit(struct proc *p)
>  
>   LIST_INSERT_HEAD(&spc->spc_deadproc, p, p_hash);
>  
> +#ifdef MULTIPROCESSOR
>   /* This process no longer needs to hold the kernel lock. */
> - KERNEL_UNLOCK();
> + KERNEL_ASSERT_LOCKED();
> + __mp_release_all(&kernel_lock);
> +#endif
>  
>   SCHED_LOCK(s);
>   idle = spc->spc_idleproc;
>

Re: panic: vinvalbuf: dirty bufs

2018-07-07 Thread Martin Pieuchot

On 06/07/18(Fri) 12:49, Alexander Bluhm wrote:
> On Mon, May 07, 2018 at 05:21:19PM +0200, Alexander Bluhm wrote:
> > panic: vinvalbuf: dirty bufs
> 
> At least I know what is going on here.
> 
> vinvalbuf() calls ffs_fsync() to write all dirty buffers of the
> mount point to disk.
> 
> if ((error = VOP_FSYNC(vp, cred, MNT_WAIT, p)) != 0)
> return (error);
> 
> ffs_fsync() does this successfully and verifies that there are no
> dirty blocks left.
> 
> if (!LIST_EMPTY(&vp->v_dirtyblkhd)) {
> 
> But then it calls ufs_update() to write the inode to disk.  It waits
> until the disk operation has finished.
> 
> return (UFS_UPDATE(VTOI(vp), ap->a_waitfor == MNT_WAIT));
> 
> My test is still running a cp -r and rm -rf operating on the file
> system.  While bread() or bwrite() sleeps in the unmount process,
> the rm process inserts a new dirty block into the vnode's list.

So we might need a barrier or a delayed free to fix this problem.

It would be nice to know where are the 'cp' and 'rm' process blocking
when the 'unmount' process goes to sleep.  You could put a break before
UFS_UPDATE() and use 'ps /up 0t$PID' to get this information.

Another interesting piece of information is if at least one of the two
processes already have a reference to `i_devvp'.

Re: Kernel panic: "kernel page fault", "uvm_fault(...)", "x86_ipi_db(...)"

2018-07-20 Thread Martin Pieuchot

On 20/07/18(Fri) 03:12, Mike Larkin wrote:
> On Wed, Jul 18, 2018 at 11:34:41PM +, Romain wrote:
> > > I'm wondering if this is due to the fact that we detach usb(4) devices on 
> > > suspend. Looks like this may be trying to process a timeout that 
> > > corresponds 
> > > to a device that is no longer attached. Maybe the urtwn(4)? 

Well the device is detaching just after re-attaching.  So it must be
something different.  But I agree with your assumption that it is
related to urtwn(4).

The problem seems to be a use-after-free of a timeout.  The question is
which timeout?  Is it in urtwn(4)?  In ic/rtwn.c?  In the wireless stack? 
In the network stack?

Our timeout_add(9) interface is simple but doesn't help to debug such
issue.

Re: Kernel panic: "kernel page fault", "uvm_fault(...)", "x86_ipi_db(...)"

2018-07-20 Thread Martin Pieuchot

On 20/07/18(Fri) 14:32, Theo de Raadt wrote:
> Martin Pieuchot  wrote:
> 
> > On 20/07/18(Fri) 03:12, Mike Larkin wrote:
> > > On Wed, Jul 18, 2018 at 11:34:41PM +, Romain wrote:
> > > > > I'm wondering if this is due to the fact that we detach usb(4) 
> > > > > devices on 
> > > > > suspend. Looks like this may be trying to process a timeout that 
> > > > > corresponds 
> > > > > to a device that is no longer attached. Maybe the urtwn(4)? 
> > 
> > Well the device is detaching just after re-attaching.  So it must be
> > something different.  But I agree with your assumption that it is
> > related to urtwn(4).
> > 
> > The problem seems to be a use-after-free of a timeout.  The question is
> > which timeout?  Is it in urtwn(4)?  In ic/rtwn.c?  In the wireless stack? 
> > In the network stack?
> > 
> > Our timeout_add(9) interface is simple but doesn't help to debug such
> > issue.
> 
> Is it a timeout not removed during detach?

That might be that or a timeout re-attached after being removed
because there's a race somewhere...

That's not the only place where we have such problem.  If somebody has
an idea or a floating diff to ease timeout debugging, that's the moment
to speak (:

Re: uaudio device works on usb2 port; fails on usb3 port

2020-08-19 Thread Martin Pieuchot

On 18/08/20(Tue) 18:53, Marcus Glocker wrote:
> On Wed, 12 Aug 2020 21:39:15 +0200
> Marcus Glocker  wrote:
> 
> > jmc was so nice to send me his trouble device over to do some further
> > investigations.  Just some updates on what I've noticed today:
> > 
> > - The issue isn't specific to xhci(4).  I also see the same issue on
> >   some of my ehci(4) machines when attaching this device.
> > 
> > - It seems like the device gets in to an 'corrupted state' after
> >   running a couple of control transfer against it.  Initially they
> >   work fine, with smaller and larger transfer sizes, and at one point
> >   the device hangs up and doesn't recover until re-attaching it.
> > While on some ehci(4) machines the uhidev(4) attach works fine, after
> >   running lsusb against the device, I see transfer errors coming up
> >   again;  On xhci(4) namely XHCI_CODE_TXERR.
> > 
> > - Attaching an USB 2.0 hub doesn't make any difference, no matter if
> >   attached to an xhci(4) or an ehci(4) controller.
> > 
> > Not sure what is going wrong with this little beast ...
> 
> OK, I give up :-)  Following my summary report.
> 
> This device seems to have issues with two control request types:
> 
> - UR_GET_STATUS, not called for this device from the kernel in the
>   default code path.  But e.g. 'lsusb -v' will call it.
> 
> - UR_SET_IDLE, as called in uhidev_attach().
> 
> UR_GET_STATUS will stall the device for good on *all* controller
> drivers.

Does this also happen when the device attaches as ugen(4)?  If yes that
would rules out concurrency issues that might happen when using lsusb(1)
while other transfers are in fly.  To test you need to disable the
current attaching driver in ukc.

> UR_SET_IDLE works only on ehci(4) - Don't ask me why.
> On all the other controller drivers the following UR_GET_REPORT request
> will fail, stalling the device as well.  I tried all kind of things to
> get the UR_SET_IDLE request working on xhci(4), but without any luck.

Does the device respond to GET_IDLE?

It it a timing problem?  How much time does the device need to be idle?
Does introducing a delay before and/or after usbd_set_idle() change the
behavior? 

Did you try passing a non-0 duration parameter to the SET_IDLE command?

Taking a step back, why does a uaudio(4) needs a UR_SET_IDLE?  This
tells the device to only respond to IN interrupt transfers when new
events occur, right?  Does all devices attaching to uhidev want this
behavior?

> The good news is that when we skip the UR_SET_IDLE request on xhci(4),
> the following UR_GET_REPORT request works, and isoc transfers also work
> perfectly fine.  You can use the device for audio streaming.
> 
> Therefore the only thing I can offer is a quirk to skip the
> UR_SET_IDLE request when attaching this device.  On ehci(4) the
> device continues to work as before with this quirk.  Therefore I
> didn't include any code to only apply the quirk on non-ehci
> controllers.
> 
> I know it's not a nice solution, but at least it makes this device
> usable on xhci(4) while not impacting other things.

Maybe it is a step towards a real solution.  Should usbd_set_idle()
stay in uhidev(4) or, if it doesn't make sense for all devices, should
we move it in child drivers like ukbd(4), etc?

> If anyone is OK with that and has no better idea how to fix it, I'm
> happy to commit.
> 
> Cheers,
> Marcus
> 
> 
> Index: uhidev.c
> ===
> RCS file: /cvs/src/sys/dev/usb/uhidev.c,v
> retrieving revision 1.80
> diff -u -p -u -p -r1.80 uhidev.c
> --- uhidev.c  31 Jul 2020 10:49:33 -  1.80
> +++ uhidev.c  18 Aug 2020 13:36:13 -
> @@ -151,7 +151,8 @@ uhidev_attach(struct device *parent, str
>   sc->sc_ifaceno = uaa->ifaceno;
>   id = usbd_get_interface_descriptor(sc->sc_iface);
>  
> - usbd_set_idle(sc->sc_udev, sc->sc_ifaceno, 0, 0);
> + if (!(usbd_get_quirks(uaa->device)->uq_flags & UQ_NO_SET_IDLE))
> + usbd_set_idle(sc->sc_udev, sc->sc_ifaceno, 0, 0);
>  
>   sc->sc_iep_addr = sc->sc_oep_addr = -1;
>   for (i = 0; i < id->bNumEndpoints; i++) {
> Index: usb_quirks.c
> ===
> RCS file: /cvs/src/sys/dev/usb/usb_quirks.c,v
> retrieving revision 1.76
> diff -u -p -u -p -r1.76 usb_quirks.c
> --- usb_quirks.c  5 Jan 2020 00:54:13 -   1.76
> +++ usb_quirks.c  18 Aug 2020 13:36:13 -
> @@ -52,6 +52,7 @@ const struct usbd_quirk_entry {
>   u_int16_t bcdDevice;
>   struct usbd_quirks quirks;
>  } usb_quirks[] = {
> + { USB_VENDOR_MICROCHIP, USB_PRODUCT_MICROCHIP_SOUNDKEY, ANY, {
> UQ_NO_SET_IDLE }}, { USB_VENDOR_KYE, USB_PRODUCT_KYE_NICHE,
> 0x100, { UQ_NO_SET_PROTO}}, { USB_VENDOR_INSIDEOUT,
> USB_PRODUCT_INSIDEOUT_EDGEPORT4, 0x094, { UQ_SWAP_UNICODE}},
> Index: usb_quirks.h
> ===
> RCS file: /cvs/src/sys/dev/usb/usb_quirks.h,v
> retrieving r

Re: uaudio device works on usb2 port; fails on usb3 port

2020-08-23 Thread Martin Pieuchot

On 21/08/20(Fri) 11:46, Marcus Glocker wrote:
> On Wed, 19 Aug 2020 20:31:05 +0200
> Marcus Glocker  wrote:
> 
> > On Wed, Aug 19, 2020 at 01:21:35PM +0200, Marcus Glocker wrote:
> > 
> > > On Wed, 19 Aug 2020 12:02:23 +0200
> > > Martin Pieuchot  wrote:
> > >   
> > > > On 18/08/20(Tue) 18:53, Marcus Glocker wrote:  
> > > > > On Wed, 12 Aug 2020 21:39:15 +0200
> > > > > Marcus Glocker  wrote:
> > > > > 
> > > > > > jmc was so nice to send me his trouble device over to do some
> > > > > > further investigations.  Just some updates on what I've
> > > > > > noticed today:
> > > > > > 
> > > > > > - The issue isn't specific to xhci(4).  I also see the same
> > > > > > issue on some of my ehci(4) machines when attaching this
> > > > > > device.
> > > > > > 
> > > > > > - It seems like the device gets in to an 'corrupted state'
> > > > > > after running a couple of control transfer against it.
> > > > > > Initially they work fine, with smaller and larger transfer
> > > > > > sizes, and at one point the device hangs up and doesn't
> > > > > > recover until re-attaching it. While on some ehci(4) machines
> > > > > > the uhidev(4) attach works fine, after running lsusb against
> > > > > > the device, I see transfer errors coming up again;  On
> > > > > > xhci(4) namely XHCI_CODE_TXERR.
> > > > > > 
> > > > > > - Attaching an USB 2.0 hub doesn't make any difference, no
> > > > > > matter if attached to an xhci(4) or an ehci(4) controller.
> > > > > > 
> > > > > > Not sure what is going wrong with this little beast ...
> > > > > 
> > > > > OK, I give up :-)  Following my summary report.
> > > > > 
> > > > > This device seems to have issues with two control request types:
> > > > > 
> > > > > - UR_GET_STATUS, not called for this device from the kernel
> > > > > in the default code path.  But e.g. 'lsusb -v' will call it.
> > > > > 
> > > > > - UR_SET_IDLE, as called in uhidev_attach().
> > > > > 
> > > > > UR_GET_STATUS will stall the device for good on *all* controller
> > > > > drivers.
> > > > 
> > > > Does this also happen when the device attaches as ugen(4)?  If yes
> > > > that would rules out concurrency issues that might happen when
> > > > using lsusb(1) while other transfers are in fly.  To test you
> > > > need to disable the current attaching driver in ukc.  
> > > 
> > > Yes, it does also happen when attaching the device to ugen(4).
> > > But honestly, I was playing around yesterday evening a bit further
> > > with this device, and I noticed that the device also stalls with
> > > lsusb when I remove the get status and get report request in the
> > > lsusb code.
> > > 
> > > Therefore I need to correct my statement, saying instead that *some*
> > > request in lsusb makes the device stall as well.  What I just found
> > > in the lsusb ChangeLog:
> > > 
> > > Added (somewhat dummy) Set_Protocol and Set_Idle requests to
> > > stream dumping setup.
> > > 
> > > I'll try to confirm if the stall really happens there.  At least
> > > that would be in line with our findings in the kernel.  
> > 
> > OK, I've tracked the two lsusb requests down finally which also stall
> > this device beside our set idle call in the kernel.
> > 
> > UR_GET_DESCRIPTOR, UDESC_DEVICE_QUALIFIER:
> > 
> > ret = usb_control_msg(fd, LIBUSB_ENDPOINT_IN |
> > LIBUSB_REQUEST_TYPE_STANDARD | LIBUSB_RECIPIENT_DEVICE,
> > LIBUSB_REQUEST_GET_DESCRIPTOR,
> > USB_DT_DEBUG << 8, 0,
> > buf, sizeof buf, CTRL_TIMEOUT);
> > 
> > UR_GET_DESCRIPTOR, UDESC_DEBUG:
> > 
> > ret = usb_control_msg(fd, LIBUSB_ENDPOINT_IN |
> > LIBUSB_REQUEST_TYPE_STANDARD | LIBUSB_RECIPIENT_DEVICE,
> > LIBUSB_REQUEST_GET_DESCRIPTOR,
> > USB_DT_DEBUG << 8, 0,
> > buf, sizeof buf, CTRL_TIMEOUT);
> > 
> > When you comment those two control requests out, lsusb -v runs
> > through.
> > 
> > If I wouldn't know better, I would say that this device

Re: VPS crash to kernel panic on boot

2020-11-26 Thread Martin Pieuchot

On 25/11/20(Wed) 19:41, AIsha Tammy wrote:
> Replicable bug that has happened from sysupgrading to snapshot.
> VPS was working perfectly until this sysupgrade.
> 
> VPS boots - drops to kernel panic ddb
> 
> Seems to be some mutex issue?
> Had to manually copy information cuz weird web console, so my apologies
> if this isn't enough information.

What is the date of the snapshots?  If you can reproduce this could you
give us the output of the "trace" command?

Thanks,
Martin

Re: kernel panic when removing interface

2020-11-26 Thread Martin Pieuchot

On 24/11/20(Tue) 09:23, Pierre Emeriaud wrote:
> > Trying to use mgre(4), I found what looks like a reliable way to crash
> > the kernel which might be of interest.
> >
> > This machine is a one-month-old-current fairly light router, with inet
> > default within rdomain 1. I will upgrade to a more recent snap
> > shortly.
> 
> I just upgraded to OpenBSD 6.8-current (GENERIC) #181: Mon Nov 23
> 20:55:15 MST 2020 and the same thing happens with vlan(4):
> 
> $ doas ifconfig vlan12 inet 192.0.2.1/24 parent vio0 vnetid 12
> $ ifconfig vlan
> vlan12: flags=8843 mtu 1500
> lladdr 02:00:00:ef:3d:d7
> index 8 priority 0 llprio 3
> encap: vnetid 12 parent vio0 txprio packet rxprio outer
> groups: vlan
> media: Ethernet autoselect
> status: active
> inet 192.0.2.1 netmask 0xff00 broadcast 192.0.2.255
> 
> $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12

I wonder if the problem isn't in the validation of these parameters.

Should we accept a L2 (-link) entry on a routing table which isn't the
routing domain?  If so why does the entry persist in the ARP cache?

Can you reproduce the problem if you don't specify T1? 

> add host 192.0.2.2/32: gateway vlan12
> 
> $ route -T1 -n show -inet
> DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
> 192.0.2.2  link#8 UHLS   00 - 8 vlan12
> 
> $ route -n show -inet
> Internet:
> DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
> 192.0.2/24 192.0.2.1  UCn00 - 4 vlan12
> 192.0.2.1  02:00:00:ef:3d:d7  UHLl   00 - 1 vlan12
> 192.0.2.255192.0.2.1  UHb00 - 1 vlan12
> 
> $ doas ifconfig vlan12 down
> $ doas ifconfig vlan12 destroy
> 
> $ route -T1 -n show -inet
> DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
> 192.0.2.2  link#8 UHLS   00 - 8 (null)
> 
> $ doas route -T1 del 192.0.2.2/32
> 
> login: panic: kernel diagnostic assertion "ifp != NULL" failed: file
> "/usr/src/sys/net/rtsock.c", line 975
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *189431  84402  00x13  00  route
> db_enter() at db_enter+0x10
> panic(81dcc1d7) at panic+0x12a
> __assert(81e32678,81e40e69,3cf,81d9f5fd) at 
> __assert+0x
> 2b
> rtm_output(80071480,8e77ce80,8e77cdd8,40,1) at 
> rtm_outp
> ut+0x7ee
> route_output(fd801ef36c00,fd801af0d698,0,0) at route_output+0x3c3
> route_usrreq(fd801af0d698,9,fd801ef36c00,0,0,8e720540) at 
> route
> _usrreq+0x21a
> sosend(fd801af0d698,0,8e77d0d8,0,0,0) at sosend+0x35b
> dofilewritev(8e720540,3,8e77d0d8,0,8e77d1b0) at 
> dofilew
> ritev+0x14d
> sys_write(8e720540,8e77d150,8e77d1b0) at 
> sys_write+0x51
> 
> syscall(8e77d220) at syscall+0x315
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7d35b0, count: 4
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.
> ddb>
>

Re: VPS crash to kernel panic on boot

2020-11-26 Thread Martin Pieuchot

On 26/11/20(Thu) 09:21, AIsha Tammy wrote:
> On 11/26/20 6:51 AM, Martin Pieuchot wrote:
> > On 25/11/20(Wed) 19:41, AIsha Tammy wrote:
> >> Replicable bug that has happened from sysupgrading to snapshot.
> >> VPS was working perfectly until this sysupgrade.
> >>
> >> VPS boots - drops to kernel panic ddb
> >>
> >> Seems to be some mutex issue?
> >> Had to manually copy information cuz weird web console, so my apologies
> >> if this isn't enough information.
> > What is the date of the snapshots?  If you can reproduce this could you
> > give us the output of the "trace" command?
> >
> > Thanks,
> > Martin
> >
> 
> Yes, reproducible crashes on multiple reboots.

Thanks, the diff below should fix it, could you test it?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.151
diff -u -p -r1.151 uvm_page.c
--- uvm/uvm_page.c  24 Nov 2020 13:49:09 -  1.151
+++ uvm/uvm_page.c  26 Nov 2020 17:17:55 -
@@ -180,7 +180,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
TAILQ_INIT(&uvm.page_active);
TAILQ_INIT(&uvm.page_inactive_swp);
TAILQ_INIT(&uvm.page_inactive_obj);
-   mtx_init(&uvm.pageqlock, IPL_NONE);
+   mtx_init(&uvm.pageqlock, IPL_VM);
mtx_init(&uvm.fpageqlock, IPL_VM);
uvm_pmr_init();

Re: kernel panic when removing interface

2020-11-27 Thread Martin Pieuchot

On 26/11/20(Thu) 20:38, Pierre Emeriaud wrote:
> Hello Martin
> 
> Le jeu. 26 nov. 2020 à 14:27, Martin Pieuchot  a écrit :
> >
> > >
> > > $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12
> >
> > I wonder if the problem isn't in the validation of these parameters.
> >
> > Should we accept a L2 (-link) entry on a routing table which isn't the
> > routing domain?  If so why does the entry persist in the ARP cache?
> 
> Which arp entry are you referring to? The one from the route I added?

Yes.  In the kernel ARP entries are represented as route entries.  So
when you add a "-link" route it is an ARP entry.

> > Can you reproduce the problem if you don't specify T1?
> 
> No. The routes are correctly removed when the interface is destroyed.
> It only crashes when the routes are added to another (non-empty if
> that matters) rdomain, but again, this was a silly mistake on my side.

Still, silly mistakes should be prevented and not crash the kernel ;)

> I reported it as it might be of interest to fix this for the sake of
> it, but it causes almost no harm.

It is, I guess a fix should go in net/rtsock.c to prevent adding "-link"
entry on routing table different from ifp->if_rdomain.

> PS: I've managed to crash my first router just by waiting a few
> seconds - no need to remove the route - same thing as the second
> router:
> ddb> show panic
> kernel diagnostic assertion "ifp != NULL" failed: file 
> "/usr/src/sys/netinet/if
> _ether.c", line 718
> 
> ddb> trace
> db_enter() at db_enter+0x10
> panic(81dc761f) at panic+0x12a
> __assert(81e321c2,81db9f2b,2ce,81d9e429) at 
> __assert+0x
> 2b
> arp_rtrequest(fd800baa10a8,fd800baa10a8,fd801aa63dc0) at 
> arp_rtrequ
> est
> arptimer(8216a090) at arptimer+0x67
> softclock_thread(8000ea40) at softclock_thread+0x13f
> end trace frame: 0x0, count: -6

Re: kernel panic when removing interface

2020-11-27 Thread Martin Pieuchot

On 27/11/20(Fri) 15:47, Denis Fondras wrote:
> > It is, I guess a fix should go in net/rtsock.c to prevent adding "-link"
> > entry on routing table different from ifp->if_rdomain.
> > 
> 
> I came up with this, which is more radical.

Which is not exactly what we want.  This will prevent adding any route
on a routing table different from rdomain.

What needs to be enforced is the check from a request coming from
userland trying to insert a "-link" route.  Such check should have the
benefit of documenting that L2 entries should be only inserted in the
rdomain table of an interface.

> Index: route.c
> ===
> RCS file: /cvs/src/sys/net/route.c,v
> retrieving revision 1.397
> diff -u -p -r1.397 route.c
> --- route.c   29 Oct 2020 21:15:27 -  1.397
> +++ route.c   27 Nov 2020 09:39:53 -
> @@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i
>   return (EINVAL);
>   ifa = info->rti_ifa;
>   ifp = ifa->ifa_ifp;
> + if (tableid != ifp->if_rdomain)
> + return (EINVAL);
>   if (prio == 0)
>   prio = ifp->if_priority + RTP_STATIC;
>  
>

Re: 6.8 GENERIC MP#1 Kernel panic on ASUS VivoBook S510U

2020-12-21 Thread Martin Pieuchot

Thanks for the report.

On 21/12/20(Mon) 17:00, Aning wrote:
> It's the second mail i try to send to mailing list. After 12 hours i still 
> can't view the first one on marc.info
> It have 15 photo attachments, but all mail was less than 25 mg. Often 
> protonmail responds when email wasn't received, but not this time.
> I hope this gives me excuse to upload screen photos onto mega.co.nz, sorry i 
> have not established my own email service and ftp yet.
> 
> Anyway here all the screen photos of ddb: 
> https://mega.nz/folder/9cwCzLIL#CymzilZEOzuA9ugLPKiVeA

It seems that sleep_finish() is called with a mutex held.  If you can
hit this panic again, could you try to type "ps /o" after getting the
"trace". 

>From the output it is not clear which thread is running and since the
trace stops (starts) at sleep_finish(), I can't figure out which code
path we're dealing with.

Re: top over SSH runaway after network drop

2020-12-25 Thread Martin Pieuchot

Hello,

On 24/12/20(Thu) 12:35, th...@liquidbinary.com wrote:
> >Synopsis:If network drops while running top over SSH, runaway process
> >Category:minor, poor handling of failure mode
> >Environment:
>   System  : OpenBSD 6.7
>   Details : OpenBSD 6.7 (GENERIC) #5: Wed Oct 28 00:25:20 MDT 2020
>
> t...@syspatch-67-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   If I SSH into any of various amd64 OpenBSD servers, virtual or 
> physical, 
> if m running a monitoring process like top, or multitail -f, on a remote 
> machine 
> over SSH and the network drops or client machine disconnects, the server 
> process 
> consumes nearly 100% of CPU and does not stop itself.  I can log back in and 
> kill the process, but until I do I have a CPU being consumed.  This affects 
> performance, possibly costing money on a virtual server.  This behavior is 
> years old.

Did you try to reproduce this bug on -current?  Is it still there?

If it is, could you please ktrace(1) the program consuming 100% of CPU
before killing it?  Then add the kdump(1) output to this bug report so
we have an idea of what it is doing and hopefully what needs to be
fixed

Thanks for your report

firefox pledge violation

2021-02-19 Thread Martin Pieuchot

Firefox from -current, tab crashes, kernel says:

firefox[86270]: pledge "", syscall 289

Trace is:

#0  shmget () at /tmp/-:3
#1  0x0b38d9347d7b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#2  0x0b38d994ac4b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#3  0x0b38d8c79eb0 in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#4  0x0b38d8c7aa2b in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#5  0x0b38d8ce44ed in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#6  0x0b38d8ce553e in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#7  0x0b38d8c7bfa1 in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#8  0x0b38d925495a in ?? () from /usr/X11R6/lib/modules/dri/swrast_dri.so
#9  0x0b3808396dea in drisw_bind_context (context=0xb37ef35a600, 
old=, draw=, read=)
at /usr/xenocara/lib/mesa/mk/libGL/../../src/glx/drisw_glx.c:394
#10 0x0b380839b30e in MakeContextCurrent (dpy=0xb38afd6, 
draw=14680067, read=14680067, gc_user=0xb37ef35a600)
at /usr/xenocara/lib/mesa/mk/libGL/../../src/glx/glxcurrent.c:220
#11 0x0b38c7109b3a in mozilla::gl::GLContextGLX::MakeCurrentImpl() const ()
   from /usr/local/lib/firefox/libxul.so.99.0
#12 0x0b38c7113f1a in mozilla::gl::GLContext::InitImpl() ()
   from /usr/local/lib/firefox/libxul.so.99.0
#13 0x0b38c7113e58 in mozilla::gl::GLContext::Init() ()
   from /usr/local/lib/firefox/libxul.so.99.0
#14 0x0b38c7109aab in mozilla::gl::GLContextGLX::Init() ()
   from /usr/local/lib/firefox/libxul.so.99.0
---Type  to continue, or q  to quit---
#15 0x0b38c71098e5 in 
mozilla::gl::GLContextGLX::CreateGLContext(mozilla::gl::GLContextDesc const&, 
_XDisplay*, unsigned long, __GLXFBConfigRec*, bool, gfxXlibSurface*) () from 
/usr/local/lib/firefox/libxul.so.99.0
#16 0x0b38c710a8bc in 
mozilla::gl::GLContextProviderGLX::CreateHeadless(mozilla::gl::GLContextCreateDesc
 const&, nsTSubstring*) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#17 0x0b38c80d977b in mozilla::WebGLContext::CreateAndInitGL(bool, 
std::__1::vector >*) () from 
/usr/local/lib/firefox/libxul.so.99.0
#18 0x0b38c80da009 in 
mozilla::WebGLContext::Create(mozilla::HostWebGLContext&, 
mozilla::webgl::InitContextDesc const&, mozilla::webgl::InitContextResult*)
() from /usr/local/lib/firefox/libxul.so.99.0
#19 0x0b38c80699c1 in 
mozilla::ClientWebGLContext::CreateHostContext(mozilla::avec2 
const&) () from /usr/local/lib/firefox/libxul.so.99.0
#20 0x0b38c806c502 in mozilla::ClientWebGLContext::SetDimensions(int, int)
() from /usr/local/lib/firefox/libxul.so.99.0
#21 0x0b38c80677d7 in 
mozilla::dom::CanvasRenderingContextHelper::UpdateContext(JSContext*, 
JS::Handle, mozilla::ErrorResult&) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#22 0x0b38c8067579 in 
mozilla::dom::CanvasRenderingContextHelper::GetContext(JSContext*, 
nsTSubstring const&, JS::Handle, mozilla::ErrorResult&) () 
from /usr/local/lib/firefox/libxul.so.99.0
#23 0x0b38c7f48113 in mozilla::dom::HTMLCanvasElement_Binding::getContext(JS
Context*, JS::Handle, void*, JSJitMethodCallArgs const&) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#24 0x0b38c80034cc in bool 
mozilla::dom::binding_detail::GenericMethod(JSContext*, unsigned int, 
JS::Value*) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#25 0x0b38ca7695e5 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs 
const&, js::MaybeConstruct, js::CallReason) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#26 0x0b38ca765cbb in Interpret(JSContext*, js::RunState&) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#27 0x0b38ca75c022 in js::RunScript(JSContext*, js::RunState&) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#28 0x0b38ca7696ec in js::InternalCallOrConstruct(JSContext*, JS::CallArgs 
const&, js::MaybeConstruct, js::CallReason) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#29 0x0b38ca769e2a in js::Call(JSContext*, JS::Handle, 
JS::Handle, js::AnyInvokeArgs const&, JS::MutableHandle, 
js::CallReason) () from /usr/local/lib/firefox/libxul.so.99.0
#30 0x0b38cad6ae6d in js::jit::InvokeFunction(JSContext*, 
JS::Handle, bool, bool, unsigned int, JS::Value*, 
JS::MutableHandle) ()
   from /usr/local/lib/firefox/libxul.so.99.0
#31 0x0b38cad6b20a in js::jit::InvokeFromInterpreterStub(JSContext*, 
js::jit::InterpreterStubExitFrameLayout*) ()
   from /usr/local/lib/firefox/libxul.so.99.0

Re: panic: uao_fin_swhash_elt: can't allocate entry

2021-02-23 Thread Martin Pieuchot

On 22/02/21(Mon) 13:48, Stuart Henderson wrote:
> Not much information on this but it's an unusual one so I thought I'd
> post in case it's of interest to anyone. (Re-typed from a screen photo,
> it's remote and used by non-technical people, this is all I have).
> 
> panic: uao_fin_swhash_elt: can't allocate entry
> Stopped at db_enter+0x10: popq %rbp
> TID   PID UID PRFLAGS PFLAGS  CPU COMMAND
> 38724523522   10010x100   0   sh
> *428940   98261   0   0x14000 0x200   1K  pagedaemon
> db_enter+0x10
> panic+0x12a
> uao_set_swslot(fd80c1ecc980,150,1f4d1) at uao_set_swslot+0x1a1
> uvmpd_scan_inactive(82188790) at uvmpd_scan_inactive+0x537
> uvmpd_scan+0x9f
> uvm_pageout(800053d0) at uvm_pageout+0x375
> end trace frame 0x0, count: 9

If it happens again could you include "show uvmexp" and "show all pools".

Re: panic: uao_fin_swhash_elt: can't allocate entry

2021-02-23 Thread Martin Pieuchot

On 23/02/21(Tue) 07:53, Jonathan Matthew wrote:
> On Mon, Feb 22, 2021 at 01:48:01PM +, Stuart Henderson wrote:
> > Not much information on this but it's an unusual one so I thought I'd
> > post in case it's of interest to anyone. (Re-typed from a screen photo,
> > it's remote and used by non-technical people, this is all I have).
> > 
> > panic: uao_fin_swhash_elt: can't allocate entry
> 
> uao_find_swhash_elt():
> 
> /* allocate a new entry for the bucket and init/insert it in */
> elt = pool_get(&uao_swhash_elt_pool, PR_NOWAIT | PR_ZERO);
> /*
>  * XXX We cannot sleep here as the hash table might disappear
>  * from under our feet.  And we run the risk of deadlocking
>  * the pagedeamon.  In fact this code will only be called by
>  * the pagedaemon and allocation will only fail if we
>  * exhausted the pagedeamon reserve.  In that case we're
>  * doomed anyway, so panic.
>  */
> if (elt == NULL)
> panic("%s: can't allocate entry", __func__);
> 
> so it sounds like the machine was so out of memory it couldn't swap.

Another hypothesis would be a kind of deadlock, showing "ps", "all pools"
and "uvmexp" would help get a better understanding.

libunwind & static+no-pie binaries

2019-11-08 Thread Martin Pieuchot

Test program below, provided by robert@ blows up when compiled with
"-static" and "-no-pie":

  $ c++ -no-pie -static e.cc && ./a.out 
  Segmentation fault (core dumped)

  #0  libunwind::EHHeaderParser::decodeEHHdr 
(addressSpace=..., ehHdrStart=4211204, ehHdrEnd=4876, ehHdrInfo=...)
at /usr/src/lib/libcxxabi/../libunwind/src/EHHeaderParser.hpp:60
  #1  libunwind::LocalAddressSpace::findUnwindSections(unsigned long, 
libunwind::UnwindInfoSections&)::{lambda(dl_phdr_info*, unsigned long, 
void*)#1}::operator()(dl_phdr_info*, unsigned long, void*) const 
(this=, pinfo=0x24a058 <_static_phdr_info>, data=)
  at /usr/src/lib/libcxxabi/../libunwind/src/AddressSpace.hpp:598
  #2  0x002110c4 in libunwind::LocalAddressSpace::findUnwindSections 
(this=, targetAddr=, info=...)
  at /usr/src/lib/libcxxabi/../libunwind/src/AddressSpace.hpp:538
  #3  libunwind::UnwindCursor::setInfoBasedOnIPRegister (this=0x7f7f8a08, 
isReturnAddress=)
  at /usr/src/lib/libcxxabi/../libunwind/src/UnwindCursor.hpp:1827
  #4  0x002103ee in unw_init_local (cursor=0x7f7f8a08, 
context=) at 
/usr/src/lib/libcxxabi/../libunwind/src/libunwind.cpp:82
  #5  0x0020fd8c in unwind_phase1 (uc=0x20f600 
<__cxxabiv1::exception_cleanup_func(_Unwind_Reason_Code, _Unwind_Exception*)>, 
  cursor=0x247f88 +16>, exception_object=0x2749dfe60)
  at /usr/src/lib/libcxxabi/../libunwind/src/UnwindLevel1.c:39
  #6  _Unwind_RaiseException (exception_object=0x2749dfe60) at 
/usr/src/lib/libcxxabi/../libunwind/src/UnwindLevel1.c:357
  #7  0x0020f5f3 in __cxa_throw (thrown_object=0x2749dfe80, 
tinfo=0x2475c8 , dest=)
  at /usr/src/lib/libcxxabi/src/cxa_exception.cpp:281
  #8  0x0020c38f in division(int, int) ()
  #9  0x0020c410 in main ()

That means it's currently impossible to profile C++ binaries on OpenBSD,
which is what we need :o)


#include 
using namespace std;

double division(int a, int b) {
  if (b == 0) {
throw "Division by zero condition!";
  }
  return (a/b);
}

int main () {
  int x = 50;
  int y = 0;
  double z = 0;

  for (uint64_t n = 40; n > 0; n--) {
try {
  z = division(x, y);
} catch (const char* msg) {
}
  }

  return 0;
}

Signal & half stopped process

2019-11-19 Thread Martin Pieuchot

When debugging a multi-threaded process with egdb(1), exiting the
debugger generally result in this:

  PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
15448   242044  100   64M  179M idle  fsleep0:11  0.00% soffice.bin
15448   251679  100   64M  179M stop/2- 0:00  0.00% soffice.bin
15448   367261   20   64M  179M stop/3- 0:00  0.00% soffice.bin
15448   203267   20   64M  179M stop/0- 0:00  0.00% soffice.bin
15448   128499  100   64M  179M stop/1- 0:00  0.00% soffice.bin
15448   369455   20   64M  179M stop/1- 0:00  0.00% soffice.bin

One or many threads are still in 'stop'.  I need to manually send a SIGCONT
for the process to exit. 

Any idea?

Re: Signal & half stopped process

2019-11-19 Thread Martin Pieuchot

On 19/11/19(Tue) 11:22, Martin Pieuchot wrote:
> When debugging a multi-threaded process with egdb(1), exiting the
> debugger generally result in this:
> 
>   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU COMMAND
> 15448   242044  100   64M  179M idle  fsleep0:11  0.00% 
> soffice.bin
> 15448   251679  100   64M  179M stop/2- 0:00  0.00% 
> soffice.bin
> 15448   367261   20   64M  179M stop/3- 0:00  0.00% 
> soffice.bin
> 15448   203267   20   64M  179M stop/0- 0:00  0.00% 
> soffice.bin
> 15448   128499  100   64M  179M stop/1- 0:00  0.00% 
> soffice.bin
> 15448   369455   20   64M  179M stop/1- 0:00  0.00% 
> soffice.bin
> 
> One or many threads are still in 'stop'.  I need to manually send a SIGCONT
> for the process to exit. 
> 
> Any idea?

After reading kernel ptrace(2) and signal code it seems to me that
PT_DETACH doesn't handle multi-threaded processes that are in SSTOP
correctly. 

This makes me wonder if `p_xstat' shouldn't be move to "struct process".

Re: USB removal kernel panic

2020-01-15 Thread Martin Pieuchot

Thanks for the report.

> ddb{0}> 
> memcpy(80165000,fd804da1f728,8d8,80165000,b5bd47118
> ed5c95a,80165000) at memcpy+0x15
> uvideo_vs_cb(fd80778f2870,801667d8,0) at uvideo_vs_cb+0x8b
> usb_transfer_complete(fd80778f2870) at usb_transfer_complete+0x20f
> xhci_event_dequeue(800af000) at xhci_event_dequeue+0x103
> xhci_softintr(800af000) at xhci_softintr+0x2d
> softintr_dispatch(1) at softintr_dispatch+0xf2
> Xsoftnet(0,819c05e0,0,18041969,80,a) at Xsoftnet+0x1f
> Xspllower(0,0,c7ef80837208d4cc,8159c000,81983ee1,708000) at 
> Xsp
> llower+0x19
> free(8159c000,2,708000) at free+0x160
> uvideo_detach(80165000,1) at uvideo_detach+0x71
> config_detach(80165000,1) at config_detach+0x152
> usbd_detach(80137500,80086d00) at usbd_detach+0x5a
> uhub_port_connect(80086d00,4,2a0,286) at uhub_port_connect+0x68
> uhub_explore(800a9500) at uhub_explore+0x23d
> usb_explore(800a9400) at usb_explore+0x12b
> usb_task_thread(80001f8efb30) at usb_task_thread+0x10b
> end trace frame: 0x0, count: -16
> ddb{0}> 
> memcpy(80165000,fd804da1f728,8d8,80165000,b5bd47118

It seems that the pipe aren't close when uvideo_detach() is called.
This is similar to the recent race fixed in uhidev(4).  It would be
great to find a generic way of handling this situation.

uhidev_detach() calls vdevgone() for example...

Re: USB removal kernel panic

2020-01-15 Thread Martin Pieuchot

On 15/01/20(Wed) 20:26, Vadim Zhukov wrote:
> I have a diff or two for that, will send when I'll come home.

After discussing the issue with Peter Stuge, we figured out that
the free should happen *after* calling config_detach() for the child
device (video(4)).

When video(4) is detached it will call:

vdevgone()->videoclose()->uvideo_close()

this last function will sleep until all I/O are finished or cancelled as
part of usbd_pipe_close(9).

Diff below should fix the issue.

Index: dev/video.c
===
RCS file: /cvs/src/sys/dev/video.c,v
retrieving revision 1.42
diff -u -p -r1.42 video.c
--- dev/video.c 6 Oct 2019 17:13:10 -   1.42
+++ dev/video.c 15 Jan 2020 19:11:20 -
@@ -463,9 +463,6 @@ videodetach(struct device *self, int fla
struct video_softc *sc = (struct video_softc *)self;
int maj, mn;
 
-   if (sc->sc_fbuffer != NULL)
-   free(sc->sc_fbuffer, M_DEVBUF, sc->sc_fbufferlen);
-
/* locate the major number */
for (maj = 0; maj < nchrdev; maj++)
if (cdevsw[maj].d_open == videoopen)
@@ -474,6 +471,8 @@ videodetach(struct device *self, int fla
/* Nuke the vnodes for any open instances (calls close). */
mn = self->dv_unit;
vdevgone(maj, mn, mn, VCHR);
+
+   free(sc->sc_fbuffer, M_DEVBUF, sc->sc_fbufferlen);
 
return (0);
 }
Index: dev/usb/uvideo.c
===
RCS file: /cvs/src/sys/dev/usb/uvideo.c,v
retrieving revision 1.205
diff -u -p -r1.205 uvideo.c
--- dev/usb/uvideo.c14 Oct 2019 09:20:48 -  1.205
+++ dev/usb/uvideo.c15 Jan 2020 19:09:48 -
@@ -644,10 +644,10 @@ uvideo_detach(struct device *self, int f
/* Wait for outstanding requests to complete */
usbd_delay_ms(sc->sc_udev, UVIDEO_NFRAMES_MAX);
 
-   uvideo_vs_free_frame(sc);
-
if (sc->sc_videodev != NULL)
rv = config_detach(sc->sc_videodev, flags);
+
+   uvideo_vs_free_frame(sc);
 
return (rv);
 }

make(1) regression

2020-01-29 Thread Martin Pieuchot

Diff below enables a ptrace(2) regress coming from NetBSD.

With usr.bin/make built since -D2020-01-14, that includes -current, it
complains during the last test:

make: Child (52049) not in table?
FAILED

That results in a failing test, however the syscall correctly reports
EBUSY.

Should I commit this first to help you look at the issue?

Index: Makefile
===
RCS file: /cvs/src/regress/lib/libc/sys/Makefile,v
retrieving revision 1.2
diff -u -p -r1.2 Makefile
--- Makefile13 Jan 2020 17:06:56 -  1.2
+++ Makefile14 Jan 2020 16:01:50 -
@@ -30,8 +30,8 @@ PROGS +=  t_access t_bind t_chroot t_cloc
 PROGS +=   t_getgroups t_getitimer t_getlogin t_getpid t_getrusage
 PROGS +=   t_getsid t_getsockname t_gettimeofday t_kill t_link t_listen
 PROGS +=   t_mkdir t_mknod t_msgctl t_msgget t_msgsnd t_msync t_pipe
-PROGS +=   t_poll t_revoke t_select t_sendrecv t_setuid t_socketpair
-PROGS +=   t_sigaction t_truncate t_umask t_write
+PROGS +=   t_poll t_ptrace t_revoke t_select t_sendrecv t_setuid
+PROGS +=   t_socketpair t_sigaction t_truncate t_umask t_write
 
 # failing tests
 .if 0
@@ -40,7 +40,6 @@ PROGS +=  t_mlock
 PROGS +=   t_mmap
 PROGS +=   t_msgrcv
 PROGS +=   t_pipe2
-PROGS +=   t_ptrace
 PROGS +=   t_stat
 PROGS +=   t_syscall
 PROGS +=   t_unlink
@@ -57,8 +56,9 @@ setup-t_truncate:
${SUDO} touch truncate_test.root_owned
${SUDO} chown root:wheel truncate_test.root_owned
 
-run-t_chroot: cleanup-t_chroot
-cleanup-t_chroot:
+run-t_chroot: cleanup-dir
+run-t_ptrace: cleanup-dir
+cleanup-dir:
${SUDO} rm -rf dir
 
 CLEANFILES =   access dummy mmap truncate_test.root_owned
@@ -100,3 +100,5 @@ run-${PROG}-$n:
 .endif
 
 .include 
+
+clean: cleanup-dir
Index: README
===
RCS file: /cvs/src/regress/lib/libc/sys/README,v
retrieving revision 1.2
diff -u -p -r1.2 README
--- README  22 Nov 2019 15:59:53 -  1.2
+++ README  28 Nov 2019 17:13:08 -
@@ -18,6 +18,7 @@ t_getrusage   - no expected fail, PR kern/
 t_mknod- remove tests for unsupported file types
 t_msgget   - remove msgget_limit test
 t_poll - remove pollts_* tests
+t_ptrace   - change EPERM -> EINVAL for PT_ATTACH of a parent 
 t_revoke   - remove basic tests, revoke only on ttys supported
 t_select   - remove sigset_t struct as it is int on OpenBSD
 
@@ -26,7 +27,6 @@ t_mlock   - wrong errno, succeeds where n
 t_mmap - ENOTBLK on test NetBSD is skipping, remove mmap_va0 test
 t_msgrcv   - msgrcv(id, &r, 3 - 1, 0x41, 004000) != -1
 t_pipe2- closefrom(4) == -1, remove F_GETNOSIGPIPE and nosigpipe test
-t_ptrace   - ptrace(0, 0, ((void *)0), 0) != -1
 t_stat - invalid GID with doas
 t_syscall  - SIGSEGV
 t_unlink   - wrong errno according to POSIX
Index: macros.h
===
RCS file: /cvs/src/regress/lib/libc/sys/macros.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 macros.h
--- macros.h19 Nov 2019 19:57:03 -  1.1.1.1
+++ macros.h29 Jan 2020 12:45:56 -
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 
 #define __RCSID(str)
 #define __COPYRIGHT(str)
@@ -26,17 +27,26 @@ int sysctlbyname(char *, void *, size_t 
 int
 sysctlbyname(char* s, void *oldp, size_t *oldlenp, void *newp, size_t newlen)
 {
-   int ktc;
-   if (strcmp(s, "kern.timecounter.hardware") == 0)
-   ktc = KERN_TIMECOUNTER_HARDWARE;
-   else if (strcmp(s, "kern.timecounter.choice") == 0)
-   ktc = KERN_TIMECOUNTER_CHOICE;
+int mib[3], miblen;
 
-int mib[3];
mib[0] = CTL_KERN;
-   mib[1] = KERN_TIMECOUNTER;
-   mib[2] = ktc;
-return sysctl(mib, 3, oldp, oldlenp, newp, newlen);
+   if (strcmp(s, "kern.timecounter.hardware") == 0) {
+   mib[1] = KERN_TIMECOUNTER;
+   mib[2] = KERN_TIMECOUNTER_HARDWARE;
+   miblen = 3;
+   } else if (strcmp(s, "kern.timecounter.choice") == 0) {
+   mib[1] = KERN_TIMECOUNTER;
+   mib[2] = KERN_TIMECOUNTER_CHOICE;
+   miblen = 3;
+   } else if (strcmp(s, "kern.securelevel") == 0) {
+   mib[1] = KERN_SECURELVL;
+   miblen = 2;
+   } else {
+   fprintf(stderr, "%s(): mib '%s' not supported\n", __func__, s);
+   return -42;
+   }
+
+return sysctl(mib, miblen, oldp, oldlenp, newp, newlen);
 }
 
 /* t_mlock.c */
Index: t_ptrace.c
===
RCS file: /cvs/src/regress/lib/libc/sys/t_ptrace.c,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 t_ptrace.c
--- t_ptrace.c  19 Nov 2019 19:57:04 -  1.1.1.1
+++ t_ptrace.c  29 Jan 2020 12:54:05 -
@@ -171

Re: make(1) regression

2020-01-29 Thread Martin Pieuchot

On 29/01/20(Wed) 15:00, Marc Espie wrote:
> On Wed, Jan 29, 2020 at 02:04:06PM +0100, Martin Pieuchot wrote:
> > Diff below enables a ptrace(2) regress coming from NetBSD.
> > 
> > With usr.bin/make built since -D2020-01-14, that includes -current, it
> > complains during the last test:
> > 
> > make: Child (52049) not in table?
> > FAILED
> > 
> > That results in a failing test, however the syscall correctly reports
> > EBUSY.
> > 
> > Should I commit this first to help you look at the issue?
> 
> At first I thought forgetting to handle WIFSTOPPED might explain things.
> 
> But looking more closely, I think the changes in make just made a system 
> bug more apparent.

Indeed I can reproduce it.  Thanks for hunting that down!

i915/drm vs WITNESS

2020-02-12 Thread Martin Pieuchot

Some warnings reported by WITNESS:

witness: lock order reversal:
 1st 0x81332b38 &rq->lock (&rq->lock)
 2nd 0x806a0050 rcs0 (&timeline->lock)
lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  __i915_request_submit+0x5b
#3  __execlists_submission_tasklet+0x1b9
#4  execlists_submit_request+0x1d1
#5  submit_notify+0x37
#6  __i915_sw_fence_complete+0x40
#7  i915_request_add+0x2d3
#8  i915_gem_init+0x2b9
#9  i915_driver_load+0x81b
#10 inteldrm_attachhook+0x2c
#11 config_process_deferred_mountroot+0x6b
#12 main+0x755
#13 longmode_hi+0x9c
lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  execlists_submit_request+0x2a
#3  submit_notify+0x37
#4  __i915_sw_fence_complete+0x40
#5  dma_i915_sw_fence_wake+0x1d
#6  notify_ring+0x1a8
#7  gen8_gt_irq_handler+0xba
#8  gen8_irq_handler+0x114
#9  intr_handler+0x6e
#10 Xintr_ioapic_edge16_untramp+0x19f
#11 acpicpu_idle+0x1d2
#12 sched_idle+0x225
#13 proc_trampoline+0x1c


witness: lock order reversal:
 1st 0x81332678 &wqh->lock (&wqh->lock)
 2nd 0x806a0050 rcs0 (&timeline->lock)
lock order "&wqh->lock"(mutex) -> "&timeline->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  execlists_submit_request+0x2a
#3  submit_notify+0x37
#4  __i915_sw_fence_complete+0x40
#5  i915_sw_fence_wake+0x39
#6  __i915_sw_fence_complete+0x131
#7  dma_i915_sw_fence_wake+0x1d
#8  notify_ring+0x1a8
#9  gen8_gt_irq_handler+0xba
#10 gen8_irq_handler+0x114
#11 intr_handler+0x6e
#12 Xintr_ioapic_edge16_untramp+0x19f
#13 acpicpu_idle+0x1d2
#14 sched_idle+0x225
#15 proc_trampoline+0x1c


witness: acquiring duplicate lock of same type: "&wqh->lock"
 1st &wqh->lock
 2nd &wqh->lock
Starting stack trace...
witness_checkorder(81333980,9,0) at witness_checkorder+0x6ba
mtx_enter(81333970) at mtx_enter+0x34
__i915_sw_fence_complete(81333970,800022280270) at 
__i915_sw_fence_complete+0x58
i915_sw_fence_wake(813339c8,1,0,800022280270) at 
i915_sw_fence_wake+0x39
__i915_sw_fence_complete(81332668,0) at __i915_sw_fence_complete+0x131
dma_i915_sw_fence_wake(813322c8,81355b20) at 
dma_i915_sw_fence_wake+0x1d
notify_ring(80a75000) at notify_ring+0x1a8
gen8_gt_irq_handler(80154000,2,8000222803b0) at 
gen8_gt_irq_handler+0xba
gen8_irq_handler(0,80154078) at gen8_irq_handler+0x114
intr_handler(800022280450,8013fd00) at intr_handler+0x6e
Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x19f
acpicpu_idle() at acpicpu_idle+0x1d2
sched_idle(81e0) at sched_idle+0x225
end trace frame: 0x0, count: 244
End of stack trace.

Re: NSD sendto issue

2020-02-17 Thread Martin Pieuchot

On 17/02/20(Mon) 14:55, Joerg Jung wrote:
> 
> > On 26. Sep 2019, at 15:02, Stuart Henderson  wrote:
> > On 2019/09/26 13:45, Stuart Henderson wrote:
> >> On 2019/09/26 11:16, Joerg Jung wrote:
> >>> Hi,
> >>> 
> >>> I run a few busy (~800 req/s) NSD servers which I upgraded 
> >>> to 6.5, all stock/default OpenBSD, e.g. I’ve not tweaked any 
> >>> sysctl values and nsd.conf matches the default as well, just 
> >>> added a few hundred zones.
> >>> 
> >>> Now, when I increase servers from default 1 to 2 in nsd.conf: 
> >>>   server-count: 2
> >>> it starts spamming my log with:
> >>>   nsd[62723]: sendto 1.2.3.4 failed: Resource temporarily unavailable
> >>> 
> >>> checking the source, server.c seems not to handle EAGAIN 
> >>> after sendto() and does not recover or retry, it just increases
> >>> txerr statistic count - so answer seems really lost :(
> >>> 
> >>> I tried higher debug level, as well as increasing socket buffers to: 
> >>>   net.inet.udp.recvspace= 65536
> >>>   net.inet.udp.sendspace=65636
> >>> but both didn’t help and netstat -s -p udp does show 
> >>>   0 dropped due to full socket buffers  
> >>> anyways. So, I don’t believe this is a socket buffer issue.
> >>> 
> >>> The same server-count: 2 setting worked fine with 6.3.
> >>> 
> >>> Any hints, insights, or pointers?
> >>> Does anyone else experience the same?
> >>> 
> >>> Thanks,
> >>> Regards,
> >>> Joerg
> >> 
> >> Maybe it's worth trying to track down further whether this is due to an
> >> NSD change or something else in the OS - cvs up -r OPENBSD_6_3 .. (be sure
> >> to use "make -f Makefile.bsd-wrapper [..]" when building).
> >> 
> > 
> > Or, following a comment from claudio@, try a kernel built with this:
> 
> FYI, I tried that diff and a few other things but neither did help. 

Did you ktrace(1) the problem?  How is sendto(2) called, in particular
is there any MSG_DONTWAIT or FNONBLOCK set on the file descriptor?  Does
that mean the kernel returns EWOULDBLOCK even if the userland said it is
fine to block?


> 
> > Index: syscalls.master
> > ===
> > RCS file: /cvs/src/sys/kern/syscalls.master,v
> > retrieving revision 1.189
> > diff -u -p -r1.189 syscalls.master
> > --- syscalls.master 11 Jan 2019 18:46:30 -  1.189
> > +++ syscalls.master 26 Sep 2019 13:01:46 -
> > @@ -261,7 +261,7 @@
> > 130 OBSOL   oftruncate
> > 131 STD { int sys_flock(int fd, int how); }
> > 132 STD { int sys_mkfifo(const char *path, mode_t mode); }
> > -133STD NOLOCK  { ssize_t sys_sendto(int s, const void *buf, \
> > +133STD { ssize_t sys_sendto(int s, const void *buf, \
> > size_t len, int flags, const struct sockaddr *to, \
> > socklen_t tolen); }
> > 134 STD { int sys_shutdown(int s, int how); }
> > 
> > 
> > Run "make syscalls" in sys/kern before building.
>

Re: upd(4): force boolean indicator to be 0 or 1

2020-02-27 Thread Martin Pieuchot

On 27/02/20(Thu) 16:58, boudew...@indes.com wrote:
> >Synopsis:boolean indicators in sensorsd.conf(5) are too cumbersome
> >Category:system
> >Environment:
>   System  : OpenBSD 6.6
>   Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 MDT 
> 2019
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   Some upd(4) devices use -1 for "On" and some use 1.  sysctl(8) and senso
> rsd(8) hide this detail from the user, which makes it difficult to define low 
> an
> d high values in sensorsd.conf(5).

Which device reports "-1" for which usage?  Is this from any
specification or is it a workaround for your device? 

Diff looks fine, although we could do simpler, see below.

Index: upd.c
===
RCS file: /cvs/src/sys/dev/usb/upd.c,v
retrieving revision 1.26
diff -u -p -r1.26 upd.c
--- upd.c   8 Apr 2017 02:57:25 -   1.26
+++ upd.c   27 Feb 2020 16:25:24 -
@@ -425,7 +425,10 @@ upd_sensor_update(struct upd_softc *sc, 
}
 
hdata = hid_get_data(buf, len, &sensor->hitem.loc);
-   sensor->ksensor.value = hdata * adjust;
+   if (sensor->ksensor.type == SENSOR_INDICATOR)
+   sensor->ksensor.value = hdata ? 1 : 0;
+   else
+   sensor->ksensor.value = hdata * adjust;
sensor->ksensor.status = SENSOR_S_OK;
sensor->ksensor.flags &= ~SENSOR_FINVALID;

Re: upd(4): force boolean indicator to be 0 or 1

2020-02-28 Thread Martin Pieuchot

On 28/02/20(Fri) 10:02, Boudewijn Dijkstra wrote:
> Op Thu, 27 Feb 2020 17:30:34 +0100 schreef Martin Pieuchot
> :
> > On 27/02/20(Thu) 16:58, boudew...@indes.com wrote:
> > > >Synopsis:boolean indicators in sensorsd.conf(5) are too 
> > > >cumbersome
> > > >Category:system
> > > >Environment:
> > >   System  : OpenBSD 6.6
> > >   Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27
> > > MDT 2019
> > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > 
> > >   Architecture: OpenBSD.amd64
> > >   Machine : amd64
> > > >Description:
> > >   Some upd(4) devices use -1 for "On" and some use 1.  sysctl(8) and
> > > sensorsd(8) hide this detail from the user, which makes it difficult
> > > to define low and high values in sensorsd.conf(5).
> > 
> > Which device reports "-1" for which usage?  Is this from any
> > specification or is it a workaround for your device?
> 
> In the misc@ thread I linked it was reported that different devices use
> different values.  My device happens to report -1 for "On".  Given how
> sensorsd.conf currently works, it would be most convenient if 0 and 1 were
> the only possible values.

You're rephrasing your diff in words.  My question is: can there be any
drawback to this approach?  Did you check the spec?  Why is your UPS
returning -1 and not 1 in this case?  Is this the right place to fix the
bug?

> > Diff looks fine, although we could do simpler, see below.
> > 
> > Index: upd.c
> > ===
> > RCS file: /cvs/src/sys/dev/usb/upd.c,v
> > retrieving revision 1.26
> > diff -u -p -r1.26 upd.c
> > --- upd.c   8 Apr 2017 02:57:25 -   1.26
> > +++ upd.c   27 Feb 2020 16:25:24 -
> > @@ -425,7 +425,10 @@ upd_sensor_update(struct upd_softc *sc,
> > }
> > hdata = hid_get_data(buf, len, &sensor->hitem.loc);
> > -   sensor->ksensor.value = hdata * adjust;
> > +   if (sensor->ksensor.type == SENSOR_INDICATOR)
> > +   sensor->ksensor.value = hdata ? 1 : 0;
> > +   else
> > +   sensor->ksensor.value = hdata * adjust;
> > sensor->ksensor.status = SENSOR_S_OK;
> > sensor->ksensor.flags &= ~SENSOR_FINVALID;
> 
> Your diff is indeed simpler, but I thought it would be cleaner to not assign
> 'adjust' when it's not needed.

That's an improvement indeed, but it isn't related to the bug you're
trying to fix ;)

Re: upd(4): force boolean indicator to be 0 or 1

2020-02-28 Thread Martin Pieuchot

On 28/02/20(Fri) 12:34, Boudewijn Dijkstra wrote:
> Op Fri, 28 Feb 2020 11:14:43 +0100 schreef Martin Pieuchot
> :
> > On 28/02/20(Fri) 10:02, Boudewijn Dijkstra wrote:
> > > Op Thu, 27 Feb 2020 17:30:34 +0100 schreef Martin Pieuchot
> > > :
> > > > On 27/02/20(Thu) 16:58, boudew...@indes.com wrote:
> > > > > >Synopsis:boolean indicators in sensorsd.conf(5) are too 
> > > > > >cumbersome
> > > > > >Category:system
> > > > > >Environment:
> > > > >   System  : OpenBSD 6.6
> > > > >   Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27
> > > > > MDT 2019
> > > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > > >
> > > > >   Architecture: OpenBSD.amd64
> > > > >   Machine : amd64
> > > > > >Description:
> > > > >   Some upd(4) devices use -1 for "On" and some use 1.  sysctl(8) 
> > > > > and
> > > > > sensorsd(8) hide this detail from the user, which makes it difficult
> > > > > to define low and high values in sensorsd.conf(5).
> > > >
> > > > Which device reports "-1" for which usage?  Is this from any
> > > > specification or is it a workaround for your device?
> > > 
> > > In the misc@ thread I linked it was reported that different devices use
> > > different values.  My device happens to report -1 for "On".  Given how
> > > sensorsd.conf currently works, it would be most convenient if 0 and
> > > 1 were the only possible values.
> > 
> > You're rephrasing your diff in words.  My question is: can there be any
> > drawback to this approach?
> 
> They're boolean indicators, I don't think there can be any.  sensorsd(8) is
> the only program in base that can use upd(4) sensors.  sysctl(8) already
> treats them as booleans.  Obviously some people will have to change their
> sensorsd.conf(5) if this goes in.
> 
> > Did you check the spec?
> 
> I checked "Universal Serial Bus Usage Tables for HID Power Devices"
> https://www.usb.org/sites/default/files/documents/pdcv10.pdf
> For every boolean indicator it specifies two allowed values: 0 and 1.
> 
> > Why is your UPS returning -1 and not 1 in this case?
> 
> No idea.  It's working fine.  Some further testing revealed that On==-1 (and
> Off==0) for all the indicators that I can easily toggle (Charging,
> Discharging, ACPresent).
> 
> > Is this the right place to fix the bug?
> 
> I think so. It's a violation of HID Power, not generic HID, so it should be
> fixed in a place where HID Power data is (first) interpreted.

Thanks for checking.  I committed the simpler diff.  If you have any
other improvement for upd(4) or any other part of the system you're
welcome to submit them to tech@.

Thanks again,
Martin

Re: [macppc] GENERIC.MP panics under high load

2020-03-28 Thread Martin Pieuchot

On 27/03/20(Fri) 22:43, Charlene Wendling wrote:
> Hi,
> 
> >Environment:
> System  : OpenBSD 6.6
> Details : OpenBSD 6.6-current (GENERIC.MP) #676: Fri Feb 14
> 02:26:37 MST 2020
> dera...@macppc.openbsd.org:/usr/src/sys/arch/macppc/compile/GENERIC.MP
> 
> Architecture: OpenBSD.macppc
> Machine : macppc
> >Description:
> 
> Note that it's still reproducible with more recent snapshots.
> 
> Running GENERIC.MP causes kernel panics if it's under high
> load. Running GENERIC causes no such issues on the two dual
> core machines belonging to the macppc ports building cluster.
> 
> It's happening since early December 2019, but is occurring even
> more since the last few weeks, at a rate becoming harmful, hence my
> report.
> 
> >How-To-Repeat:
> 
> Start a bulk with dpb(1) with GENERIC.MP, it should panic anytime
> before 4 days. If you're lucky it will crash straight while listing
> ports.

Thanks for the report.  If you have the patience to continue gather
such crash please do send the same report every time.  It is
interesting to see that CPU0 is in uvm_swap_io() here.

It would be nice to know if there's a common pattern between what seems
to be a memory corruption on CPU1 and what CPU0 is doing at that moment.

This might be a MD or MI bug, so the more information you get us the
better :o)

> 
> >Fix: 
> 
> None.
> 
> --
> 
> ddb{1}> machine ddbcpu 0   
> Stopped at  db_enter+0x10:  lwz r0,36(r1)
> db_enter() at db_enter+0xc   
> openpic_ipi_ddb() at openpic_ipi_ddb+0xc
> openpic_ext_intr() at openpic_ext_intr+0x254
> extint_call() at extint_call
> --- interrupt ---   
> at 0xe000dffc
> ttyinput(e0005a00,e0008100) at ttyinput+0x8c
> zstty_rxsoft(6428,e0019000) at zstty_rxsoft+0x150
> zstty_softint(5ab65d38) at zstty_softint+0xb0
> zsc_intr_soft(ecd8) at zsc_intr_soft+0x7c
> zssoft(ecd8) at zssoft+0x64  
> softintr_dispatch(ec00) at softintr_dispatch+0x80
> dosoftint(1) at dosoftint+0xa4   
> openpic_splx(100) at openpic_splx+0xa4
> splx(65727000) at splx+0x1c   
> end trace frame: 0xe629c780, count: 0
> 
> ddb{0}> trace
> db_enter() at db_enter+0xc
> openpic_ipi_ddb() at openpic_ipi_ddb+0xc
> openpic_ext_intr() at openpic_ext_intr+0x254
> extint_call() at extint_call
> --- interrupt ---   
> at 0xe000dffc
> ttyinput(e0005a00,e0008100) at ttyinput+0x8c
> zstty_rxsoft(6428,e0019000) at zstty_rxsoft+0x150
> zstty_softint(5ab65d38) at zstty_softint+0xb0
> zsc_intr_soft(ecd8) at zsc_intr_soft+0x7c
> zssoft(ecd8) at zssoft+0x64  
> softintr_dispatch(ec00) at softintr_dispatch+0x80
> dosoftint(1) at dosoftint+0xa4   
> openpic_splx(100) at openpic_splx+0xa4
> splx(65727000) at splx+0x1c   
> tsleep(6428,92,e629c7d0,0) at tsleep+0x98
> biowait(1) at biowait+0x5c   
> uvm_swap_io(,0,0,2000) at uvm_swap_io+0x5f4
> uvm_swap_get(3e60590,3e60590,e629c8e0) at uvm_swap_get+0x58
> uvmfault_anonget(400,5,e629c930) at uvmfault_anonget+0x1ac 
> uvm_fault(6ab1e668,40f8050,e629c970,20009034) at uvm_fault+0x554
> trap(6f3b63c8) at trap+0x68c
> trapagain() at trapagain+0x4
> --- trap (type 0x300) ---   
> at 0xe629cbf0
> ureadc(e0005a00,0) at ureadc+0x128
> ttread(6ab49338,300,e629cc90) at ttread+0x368
> zsread(f4f958,40004048,1a2454c0) at zsread+0x58
> spec_read(fe2f60) at spec_read+0x354   
> ufsspec_read(2001) at ufsspec_read+0x20
> VOP_READ(925e6c,f4f680,e629cdd0,0) at VOP_READ+0x50
> vn_read(1,1,e629ce20) at vn_read+0xc4  
> dofilereadv(6ab49338,e629ce48,e629cec0,6ab49374,2e) at dofilereadv+0xd0
> sys_read(d891b0a8,6ab49374,e629cea4) at sys_read+0x64  
> trap(6ab49338) at trap+0x9f0 
> trapagain() at trapagain+0x4
> --- syscall (number 3) ---  
> End of kernel: 0xfffcef70 
> end trace frame: 0xfffcef70, count: -34
> 
> ddb{0}> machine ddbcpu 1   
> Stopped at  db_enter+0x10:  lwz r0,36(r1)
> db_enter() at db_enter+0xc   
> panic(0) at panic+0xe0
> rw_assert_rdlock(e61f9e88) at rw_assert_rdlock+0x60
> rw_exit_read(9737f8) at rw_exit_read+0x1c  
> if_input_process(792280,e61f9f28) at if_input_process+0x68
> ifiq_process() at ifiq_process+0x78   
> taskq_thread(e0007040) at taskq_thread+0x58
> fork_trampoline() at fork_trampoline+0x14  
> end trace frame: 0x0, count: 7 
>   
> ddb{1}> trace 
> db_enter() at db_enter+0xc
> panic(0) at panic+0xe0
> rw_assert_rdlock(e61f9e88) at rw_assert_rdlock+0x60
> rw_exit_read(9737f8) at rw_exit_read+0x1c  
> if_input_process(792280,e61f9f28) at if_input_process+0x68
> ifiq_process() at ifiq_process+0x78   
> taskq_thread(e0007040) at taskq_thread+0x58
> fork_trampoline() at fork_t

Re: 6.6-current stutters after heavy disk loads

2020-03-28 Thread Martin Pieuchot

On 28/03/20(Sat) 09:33, Martin wrote:
> After about a week of tests on freshly installed system i can conclude that 
> two things affect on stutters 6.6 amd64 with all the patches included. To 
> exclude hardware related problems I've changed AMD SOC PC to a new different 
> one with the exactly the same configuration.

What do you mean with "stutters"?

Could you run "top -SH -s .3" and describe what you seen what that
happens?

Does those "stutters", or whatever you mean with that, are present
with GENERIC (non MP)?

Re: arpresolve: XX: route contains no arp information

2020-03-28 Thread Martin Pieuchot

On 28/03/20(Sat) 15:30, Stuart Henderson wrote:
> After updating my laptop from
> 
> OpenBSD 6.6-current (GENERIC.MP) #653: Thu Feb 20 21:40:37 MST 2020
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
> to
> 
> OpenBSD 6.6-current (GENERIC.MP) #84: Fri Mar 27 23:50:29 MDT 2020
> 
> I've started seeing a lot of these:
> 
> /bsd: arpresolve: 10.15.5.1: route contains no arp information
> last message repeated 436 times
> 
> Local subnet is working, traffic going via default route is not (ping
> reports "sendmsg: Invalid argument".
> 
> Network config is dhcp running on iwm0/em0 as separate interfaces (no trunk,
> just using default priorities to prefer wired) and the wlan is on the same
> subnet as ethernet. This previously worked just fine. ifconfig/route 
> tables/dmesg
> are below.
> 
> While I'm bisecting, does anyone have an idea what might have introduced it?

If that happens on the em(4) and not the iwm(4) that would indicate
that one of the em(4) changes might be the cause of the regression.

> 
> $ ifconfig | sed 's/IMEI [0-9]* /IMEI xxx /'
> lo0: flags=8049 mtu 32768
>   index 4 priority 0 llprio 3
>   groups: lo
>   inet6 ::1 prefixlen 128
>   inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4
>   inet 127.0.0.1 netmask 0xff00
> iwm0: 
> flags=a08843 mtu 
> 1500
>   lladdr e4:a4:71:4f:84:36
>   index 1 priority 4 llprio 3
>   groups: wlan egress
>   media: IEEE802.11 autoselect (HT-MCS8 mode 11n)
>   status: active
>   ieee80211: join Y2 chan 6 bssid 04:4f:aa:0c:3a:e8 65% wpakey wpaprotos 
> wpa2 wpaakms psk wpaciphers ccmp wpagroupcipher ccmp
>   inet 10.15.5.125 netmask 0xff00 broadcast 10.15.5.255
>   inet6 fe80::e6a4:71ff:fe4f:8436%iwm0 prefixlen 64 scopeid 0x1
>   inet6 2a02:8011:7003:3:650f:596b:8366:9863 prefixlen 64 autoconf pltime 
> 604462 vltime 2591662
>   inet6 2a02:8011:7003:3:cd3b:7b95:ddd3:b52d prefixlen 64 autoconf 
> autoconfprivacy pltime 85922 vltime 604432
> em0: flags=a08843 
> mtu 1500
>   lladdr c8:5b:76:cf:a8:ca
>   index 2 priority 0 llprio 3
>   groups: egress
>   media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
>   status: active
>   inet 10.15.5.82 netmask 0xff00 broadcast 10.15.5.255
>   inet6 fe80::ca5b:76ff:fecf:a8ca%em0 prefixlen 64 scopeid 0x2
>   inet6 2a02:8011:7003:3:984f:bf36:3107:f0 prefixlen 64 autoconf pltime 
> 604462 vltime 2591662
>   inet6 2a02:8011:7003:3:77ba:944a:ab32:46c2 prefixlen 64 autoconf 
> autoconfprivacy pltime 85608 vltime 604427
> enc0: flags=0<>
>   index 3 priority 0 llprio 3
>   groups: enc
>   status: active
> umb0: flags=8810 mtu 1500
>   index 5 priority 6 llprio 3
>   roaming disabled registration not registered
>   state open cell-class none
>   SIM not initialized PIN valid (3 attempts left)
>   device EM7455 IMEI XXX firmware SWI9X30C_02.24.05.06
>   APN pp.vodafone.co.uk
>   status: down
> pflog0: flags=141 mtu 33136
>   index 6 priority 0 llprio 3
>   groups: pflog
> 
> 
> $ netstat -rnfinet
> Routing tables
> 
> Internet:
> DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
> default10.15.5.1  UGS2 2233 - 8 em0  
> default10.15.5.1  UGS00 -12 iwm0 
> 224/4  127.0.0.1  URS0  476 32768 8 lo0  
> 10.15.5/24 10.15.5.82 UCn3  128 - 4 em0  
> 10.15.5/24 10.15.5.125UCn10 - 8 iwm0 
> 10.15.5.1  00:00:5e:00:01:05  UHLch  18 - 7 iwm0 
> 10.15.5.2  f8:b1:56:ac:32:76  UHLc   0   64 - 3 em0  
> 10.15.5.5  link#2 UHLc   0   43 - 3 em0  
> 10.15.5.9  dc:a6:32:03:7a:01  UHLc   0   69 - 3 em0  
> 10.15.5.82 c8:5b:76:cf:a8:ca  UHLl   09 - 1 em0  
> 10.15.5.125e4:a4:71:4f:84:36  UHLl   07 - 1 iwm0 
> 10.15.5.25510.15.5.82 UHPb   0   36 - 1 em0  
> 10.15.5.25510.15.5.125UHPb   00 - 1 iwm0 
> 127/8  127.0.0.1  UGRS   00 32768 8 lo0  
> 127.0.0.1  127.0.0.1  UHhl   2  159 32768 1 lo0  
> 
> 
> OpenBSD 6.6-current (GENERIC.MP) #84: Fri Mar 27 23:50:29 MDT 2020
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 8438898688 (8047MB)
> avail mem = 8170549248 (7792MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xd705d000 (63 entries)
> bios0: vendor LENOVO version "R02ET70W (1.43 )" date 01/28/2019
> bios0: LENOVO 20F6006YUK
> acpi0 at bios0: ACPI 5.0
> acpi0: sleep states S0 S3 S4 S5
> acpi0: tables DSDT FACP UEFI SSDT SSDT ECDT HPET APIC MCFG SSD

Re: arpresolve: XX: route contains no arp information

2020-03-30 Thread Martin Pieuchot

On 29/03/20(Sun) 17:17, Stuart Henderson wrote:
> [...] 
> I guess I'll just move it to a wifi network on a different vlan for now.

Well I wouldn't be surprise if the issue is exposed by the use of two
cloning routes.  One way to move forward would be to add the name of the
interface in the error message.

Re: arpresolve: XX: route contains no arp information

2020-03-31 Thread Martin Pieuchot

On 30/03/20(Mon) 22:11, Stuart Henderson wrote:
> On 2020/03/30 10:29, Martin Pieuchot wrote:
> > On 29/03/20(Sun) 17:17, Stuart Henderson wrote:
> > > [...] 
> > > I guess I'll just move it to a wifi network on a different vlan for now.
> > 
> > Well I wouldn't be surprise if the issue is exposed by the use of two
> > cloning routes.  One way to move forward would be to add the name of the
> > interface in the error message.
> > 
> 
> The arpresolve "route contains no arp information" is on em0.
> The only entry relating to the gateway (10.15.5.1) showing in
> arp -an after boot is
> 
> 10.15.5.1  (incomplete) iwm0 Expired
> 
> (no entry for the gateway on em0).
> 
> Oddities:
> 
> - if I connect to a WPA-PSK network then iwm0 comes up quickly in
> the "starting network" stage of /etc/rc and I have the problem. But
> if I switch to a WPA-Enterprise network which doesn't connect until
> wpa_supplicant starts ("starting package daemons" stage) then I don't
> see the problem.
> 
> - but it's not just timing related though: I can add "!ping -c1 1.1.1.1"
> or "!sleep 60" to hostname.em0 and still see the problem.

Are you saying that an incomplete ARP entry exists and has been cloned
via the route attached via iwm0?  However this entry is incomplete.

Now when trying to reach the address pointed by this ARP entry via em0
no other cloned entry is created and you get the "no arp info" message.

That would imply the ARP cache isn't correctly flushed when em0 receives
an address on the same subnet as iwm0.

Re: 6.6-current stutters after heavy disk loads

2020-03-31 Thread Martin Pieuchot

On 31/03/20(Tue) 15:08, Martin wrote:
> 1. top -SH -s .3 points me that stutters arrive once process changing its 
> state from 'idle' to 'active' with related disk activity.

What about %spin and %intr?

> 2. Any machine with 6.6 GENERIC.MP affected.
> 2.1. 4-core AMD GX-420CA SOC - stutters more visible;
> 2.2. 2-core Intel i7-2640m - very rare stutters when process changing its 
> state from 'idle' to 'active'.
> 3. GENERIC (no MP) - stutters are minimal, after 48 hours I can see them 
> very, very rare and on AMD SOC only.

Valuable information.

Re: 6.6-current stutters after heavy disk loads

2020-04-02 Thread Martin Pieuchot

On 02/04/20(Thu) 12:58, Martin wrote:
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, March 31, 2020 3:27 PM, Martin Pieuchot  wrote:
> 
> > On 31/03/20(Tue) 15:08, Martin wrote:
> >
> > > 1.  top -SH -s .3 points me that stutters arrive once process changing 
> > > its state from 'idle' to 'active' with related disk activity.
> >
> > What about %spin and %intr?
> 
> 1. AMD GX-420CA SOC 4-core 4-thread
> 
> CPU0 %spin from 2.0% to 17.0% %intr 30.0%-96.0%
> CPU1-3 %spin 0.0% (always) %intr 15.0%-99.0%
> 
> 2. i7-2640m 2-core 4-thread
> 
> CPU0 %spin from 0.0% to 3.0% %intr 0.0% (always)
> CPU2 %spin from 0.0% to 2.0% (rare) %intr 0.0% (always)

Interesting so whatever that is it seems related or amplified by a lot
of time spent dealing with interrupt.

You can use "systat -s .3" and/or "vmstat -i" to figure out which
interrupt has a higher rate when you observe the symptoms.

If nobody has a idea of what that could be, another useful information
would be to produce a flamegraph when you observe the stutters.  For that
you need to enable dt(4) in conf/GENERIC build & install a new kernel,
build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf.
After rebooting in the new kernel run the following:

# btrace -e 'profile:hz:15 { printf("%s1\n", kstack); }' > kstack.txt 

and it Ctrl+C to stop the profiling.

Then you can build the Flamegraph with the tools described below or
provide us the captured stack traces:
  https://github.com/brendangregg/FlameGraph

Re: 6.6-current stutters after heavy disk loads

2020-04-02 Thread Martin Pieuchot

On 02/04/20(Thu) 13:59, Martin wrote:
> ‐‐‐ Original Message ‐‐‐
> On Thursday, April 2, 2020 1:21 PM, Martin Pieuchot  wrote:
> > On 02/04/20(Thu) 12:58, Martin wrote:
> >
> > > ‐‐‐ Original Message ‐‐‐
> > > On Tuesday, March 31, 2020 3:27 PM, Martin Pieuchot m...@openbsd.org 
> > > wrote:
> > >
> > > > On 31/03/20(Tue) 15:08, Martin wrote:
> > > >
> > > > > 1.  top -SH -s .3 points me that stutters arrive once process 
> > > > > changing its state from 'idle' to 'active' with related disk activity.
> > > >
> > > > What about %spin and %intr?
> > >
> > > 1.  AMD GX-420CA SOC 4-core 4-thread
> > >
> > > CPU0 %spin from 2.0% to 17.0% %intr 30.0%-96.0%
> > > CPU1-3 %spin 0.0% (always) %intr 15.0%-99.0%
> > >
> > > 2.  i7-2640m 2-core 4-thread
> > >
> > > CPU0 %spin from 0.0% to 3.0% %intr 0.0% (always)
> > > CPU2 %spin from 0.0% to 2.0% (rare) %intr 0.0% (always)
> >
> > Interesting so whatever that is it seems related or amplified by a lot
> > of time spent dealing with interrupt.
> >
> > You can use "systat -s .3" and/or "vmstat -i" to figure out which
> > interrupt has a higher rate when you observe the symptoms.
> 
> 1. AMD SOC
> systat -s .3 seems interrupts too (stutters) when system wide stutter appears.
> Interrupts
> 500-1200 total
> 96-98 clock
> 155-350,sometimes up to 1100 ipi

A lot of IPIs!  We're making progress.  This rings a bell, I'd suggest
you look at my slides/talk from EuroBSDCon2017 called: "Your scheduler
is not the problem".  This might not be a similar problem but it gives
a lot of insides about how to debug further.

Which application are you running to trigger those?  What is the
"background process" that you're talking about?  Did you ktrace(1) it?
What is it doing when you see the stutters?

The picture now seems to be clearer: something is causing a high number
of IPI.  That creates latency and all other task are somehow delayed
resulting in some stuttering.

The question now becomes: why so many IPIs are being generated and is it
possible to lower the insanely high rate.

Please make sure to do the ktracing first, that should give us the
userland view of the situation.  Then you could additionally do the
Flamegraph gathering which should give us the kernel view of situation.

> > If nobody has a idea of what that could be, another useful information
> > would be to produce a flamegraph when you observe the stutters. For that
> > you need to enable dt(4) in conf/GENERIC build & install a new kernel,
> > build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf.
> > After rebooting in the new kernel run the following:
> >
> > btrace -e 'profile:hz:15 { printf("%s1\n", kstack); }' > kstack.txt
> >
> > 
> >
> > and it Ctrl+C to stop the profiling.
> >
> > Then you can build the Flamegraph with the tools described below or
> > provide us the captured stack traces:
> > https://github.com/brendangregg/FlameGraph
> 
>

Re: Fwd: Re: bird crashes kernel

2020-04-02 Thread Martin Pieuchot

On 02/04/20(Thu) 16:22, Bastien Durel wrote:
> Hello,
> 
> Here is the initial report I made on misc@ about a kernel panic
> triggered by route removal by bird (bird-2.0.6 from ports)

This should be fixed in -current by a commit krw@ did back in November,
could you test a snapshot and see?

Cheers,
Martin

Re: Fwd: Re: bird crashes kernel

2020-04-02 Thread Martin Pieuchot

On 02/04/20(Thu) 18:30, Bastien Durel wrote:
> Le jeudi 02 avril 2020 à 17:15 +0200, Martin Pieuchot a écrit :
> > On 02/04/20(Thu) 16:22, Bastien Durel wrote:
> > > Hello,
> > > 
> > > Here is the initial report I made on misc@ about a kernel panic
> > > triggered by route removal by bird (bird-2.0.6 from ports)
> > 
> > This should be fixed in -current by a commit krw@ did back in
> > November,
> > could you test a snapshot and see?
> > 
> I prefer not to run my main router on a snapshot, but a VM crashes too
> when stopping bird with 6.6-stable, and indeed does not when running
> the last snapshot.

Thanks for confirming.

> But after stopping bird, the network is down (I cannot even ping the
> gateway, although dhcp works) -- and now the VM does no boot anymore :/

If you mind, please send a different mail with all the details since
that is not the same issue.

Re: 6.6-current stutters after heavy disk loads

2020-04-03 Thread Martin Pieuchot

On 02/04/20(Thu) 18:40, Martin wrote:
> Before starting the video 2017 bsdcon, disabled all the packages software on 
> both AMD and i7 and run mpv player and test both machines.

What do you mean?  Which software are running?  What do you see in 
"top -SH -s .3"?

> Shutters on both platforms happened when APM change low CPU frequency to 
> high. Maybe it's an apmd issue?

No it is not, it is just a symptom.

Please let's stick to the original question: which piece of software are
you running when you see the stutters.  Is it mpv(1)?

When running mpv(1) do you see high IPIs?  If so did you ktrace(1) it?

Re: 6.6-current stutters after heavy disk loads

2020-04-03 Thread Martin Pieuchot

On 03/04/20(Fri) 09:40, Martin wrote:
> Hello, Martin.
> [...]
> When I run mpv and try to watch 720p video. In case of stutters after some 
> time of watching audio flow desyncronized with video flow and mpv show video 
> FPS/2 rate afterwards.
> 
> Each time of stutter mpv increase 'Dropped' like
> 
> A-V: 0.000 Dropped: 58++ Cahce: 1378s+154MB

Ok so the piece of software is mpv(1).

> I did ktrace for mpv process. I run and see by 'kdump -H ktrace.out' that it 
> has
> one process ID and / mostly one-three thread used.
> But sometimes (assuming in stutter times) it jumping against treads with 
> different numbers.

Could you upload the output of kdump -H somewhere such that I could look
at it or compress it and send it?

> Yes, IPI increased to 900-1000 when stutter appears.
> 
> I'm going to disable step-by step each 'out of the box' software to determine 
> the reason. Am I right doing this way?

I believe it isn't necessary.  From what you are saying it seems that
mpv(1) alone is the piece of software exposing the issue.

There are multiple possible reasons for IPIs, but if the high rate
you're seeing is exposed by mpv(1), it would suggest they are related
to scheduling.

By looking at the output of ktrace(1) we should have a better
understanding of what is happening in userland.  I'd suggest you also do
the btrace(8) profiling so we can see which code path in the kernel is
responsible for the IPIs.  These should allows us to work with facts and
not guesses.

Thanks for your efforts,
Martin

Re: [macppc] GENERIC.MP panics under high load

2020-04-06 Thread Martin Pieuchot

On 06/04/20(Mon) 16:54, Charlene Wendling wrote:
> On Wed, 1 Apr 2020 20:27:54 +0200
> Charlene Wendling wrote:
> 
> I've got another one, still with:
> 
> OpenBSD 6.6-current (GENERIC.MP) #692: Sat Mar 21 10:19:57 MDT 2020
> dera...@macppc.openbsd.org:/usr/src/sys/arch/macppc/compile/GENERIC.MP

Trying a WITNESS kernel might give us more information.  That would
require somebody implementing stacktrace_save() for powerpc.

Re: arpresolve: XX: route contains no arp information

2020-04-07 Thread Martin Pieuchot

Thanks for your report.

On 07/04/20(Tue) 16:04, Laurent Salle wrote:
> On 06/04/2020 14.36, Laurent Salle wrote:
> > If you wish, I may do some more test the next time the problem occurs.
> 
> I've done more tests.
> 
> This time, I've noticed the following message on the console "arpresolve:
> unresolved and rt_expire == 0"

It's the same bug as reported by sthen@.  Two interfaces in the same subnet
have two identical cloning routes:

> 192.168.1/24   192.168.1.4UCn1  887 - 4 em0
> 192.168.1/24   192.168.1.18   UCn1   47 - 8 iwm0

ARP entries are "cloned" from one of these two.  It should be only one
at a time, obviously the one with higher priority.

Then I believe dhclient(8) inserts default routes for both interfaces:

> default192.168.1.254  UGS526091 - 8 em0
> default192.168.1.254  UGS0 1671 -12 iwm0

One of these routes, the one with higher priority, is picked when 
sending packets to "8.8.8.8", now the kernel needs to find the ARP entry
corresponding to "192.168.1.254":

> 192.168.1.254  f4:ca:e5:55:0d:2d  UHLc   0 1108 - 3 em0
> 192.168.1.254  link#1 UHLch  1 1468 - 7 iwm0

First question is why an entry is cached ('h') when the other isn't?  Both
should be.

Second question is why the entry on iwm0 is returned when the query is
done on em0. 

Answering those questions should be enough to fix the bug :o)

Re: Crash while using ospfd over vxlan

2020-04-10 Thread Martin Pieuchot

On 09/04/20(Thu) 16:10, Massimiliano Stucchi wrote:
> >Synopsis:Crash while using ospfd over vxlan
> >Category:bug
> >Environment:
>   System  : OpenBSD 6.6
>   Details : OpenBSD 6.6 (GENERIC.MP) #5: Sun Feb 16 01:56:11 MST 2020
>   
> r...@syspatch-66-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   Setting up an OSPF session over VXLAN leads to a kernel crash
> >How-To-Repeat:
> 
> I have setup an ospf session over a vxlan interface.  When this is up,
> it takes about 2-3 minutes for the crash to consistently happen.
> 
> No other action is necessary.
> 
> At this address:
> 
> https://max.stucchi.ch/bugreport/
> 
> you can find screenshots from the ddb prompt, including a full trace.
> 
> If needed, I can also provide access to the console.

It's a recursion.  I don't know anything about vxlan(4) or how the
encapsulation works but the following happens at least 10 times:

...
vxlan_lookup()
udp_input()
ip_deliver()
ip_ours()
ip_input_if()
ipv4_input()
ether_input()
if_vinput()
vxlan_lookup()
...

Maybe you can share your setup (vxlan config, ospf config, etc) so
somebody can try to reproduce and fix it.

Re: [pc engines apu1d4] kernel crash periodically

2020-04-10 Thread Martin Pieuchot

Hello Pascal,

On 09/04/20(Thu) 16:10, Pascal Cabaud wrote:
> > Synopsis:   Once a day, my APU1D4 used to crash
> > Category:   kernel
> > Environment:
>   System  : OpenBSD 6.6
>   Details : OpenBSD 6.6 (GENERIC.MP) #372: Sat Oct 12 10:56:27 MDT 
> 2019
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> > Description:
>   Periodically, my AP1D4 (pcengines.ch) used to crash. Here, the last
>   crash :

Thanks for the report.  This looks to me like a memory corruption in
the kernel.  This is hard to understand because the panic(9) only show
the symptom, not the actual bug.

I see it the ps output below that you're running quite some different
pieces of software (relayd, collectd, etc).  Could you try disabling
one for some time and see if the crash disappear?  This would be an
indication that that particular piece of software exposes the bug.

If the crash doesn't disappear with that particular piece of software,
you can enable it again an disable the next one :o)

Thanks!

> Stopped at  pool_gc_pages+0x67: movq0x10(%rax),%r11
> ddb{0}> show panic
> kernel page fault
> uvm_fault(0x81f508f8, 0x9cff81d46840, 0, 1) -> e
> pool_gc_pages(0) at pool_gc_pages+0x67
> end trace frame: 0x800022043c20, count: 0
> ddb{0}> trace
> pool_gc_pages(0) at pool_gc_pages+0x67
> taskq_thread(81f1f948) at taskq_thread+0x4d
> end trace frame: 0x0, count: -2
> ddb{0}> mach ddbcpu 1
> Stopped at  x86_ipi_db+0x12:leave
> ddb{1}> show panic
> kernel page fault
> uvm_fault(0x81f508f8, 0x9cff81d46840, 0, 1) -> e
> x86_ipi_db(800022000ff0) at x86_ipi_db+0x12
> end trace frame: 0x80002204f3e0, count: 0
> ddb{1}> trace
> x86_ipi_db(800022000ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi(0,0,1388,0,8007b4e0,8000220016f8) at 
> Xresume_lapi
> c_ipi+0x23
> acpicpu_idle() at acpicpu_idle+0x14d
> sched_idle(800022000ff0) at sched_idle+0x225
> end trace frame: 0x0, count: -5
> ddb{1}> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>  95276   13988  1  0  30x80  kqreadrelayd
>  26963   27946  1 89  30x100092  kqreadrelayd
>  55989   95900  1 89  30x100092  kqreadrelayd
>   9916   26805  1 89  30x100092  kqreadrelayd
>  95143  161715  1 89  30x100092  kqreadrelayd
>  40331  479376  1 89  30x100092  kqreadrelayd
>  30572  497829  1 89  30x100092  kqreadrelayd
>  66569  297122  1 89  30x100092  kqreadrelayd
>  581746892  1 89  30x100092  kqreadrelayd
>   9074  397819  1 89  30x100092  kqreadrelayd
>   3400  372150  1 89  30x100092  kqreadrelayd
>  20384  188200  1 89  30x100092  kqreadrelayd
>  35091  435040  1 89  30x100092  kqreadrelayd
>  61458   99434  70149   1000  30x100083  ttyin more
>  70149  217271  94483   1000  30x100083  wait  man
>  94483  387711   6290   1000  30x10008b  pause ksh
>  70627  315246  76360  0  30x100083  piperdgrep
>  34778  359396  76360  0  30x100083  kqreadtail
>  76360  498368  91100  0  30x10008b  pause ksh
>  91100  335621   6290   1000  30x10008b  pause ksh
>  36467  185961  75892  0  30x100083  ttyin ksh
>  75892  445733   6290   1000  30x10008b  pause ksh
>  44772   43154  53744  0  30x100083  ttyin ksh
>  53744   87181   6290   1000  30x10008b  pause ksh
>   6290  420923  1   1000  30x100080  kqreadtmux
>  32341  184709  6   1000  30x100083  kqreadtmux
>  6  387915  87108   1000  30x10008b  pause ksh
>  87108  302555  62336   1000  30x90  selectsshd
>  62336  287615  11348  0  30x92  poll  sshd
>   7752  285805  99541 83  30x100092  poll  ntpd
>  99541  162749  99451 83  30x100092  poll  ntpd
>  99451  435657  1  0  30x100080  poll  ntpd
>  93391   41680  60584 53  30x90  kqreadunbound
>  60584  394085  1 53  30x90  kqreadunbound
>  44674  492072  13228   1000  30x1000b3  poll  ping
>  759728907  1750  30x81  nanosleep perl
>  85218   71844  0  0  3 0x14200  bored sosplice
>  13228  249507  1   1000  30x10008b  pause ksh
>   6956  109690  1  0  30x100098  poll  cron
>  11885  194741  1  0  30x80  nanosleep collectd
>  11885   75930  1  0  3   0x480  fsleepcollectd
>  11885  472585  1  0  3   0x40

Re: arpresolve: XX: route contains no arp information

2020-04-10 Thread Martin Pieuchot

On 09/04/20(Thu) 20:22, Laurent Salle wrote:
> On 08/04/2020 06.52, Martin Pieuchot wrote:
> 
> > It's the same bug as reported by sthen@.  Two interfaces in the same subnet
> > have two identical cloning routes:
> 
> I've been able to reproduce systematically the problem with an OpenBSD
> virtual machine running the latest snapshot and two vio interface with
> different priority connected to the same lan with dhcp.

Thanks for the report!  Diff below seems to fix the issue here, could
you try it?

Index: netinet/if_ether.c
===
RCS file: /cvs/src/sys/netinet/if_ether.c,v
retrieving revision 1.242
diff -u -p -r1.242 if_ether.c
--- netinet/if_ether.c  7 Nov 2019 11:23:23 -   1.242
+++ netinet/if_ether.c  10 Apr 2020 08:45:42 -
@@ -559,6 +559,23 @@ in_arpinput(struct ifnet *ifp, struct mb
 
KERNEL_LOCK();
error = arpcache(ifp, ea, rt);
+   if (error == 0 && ISSET(rt->rt_flags, RTF_CACHED)) {
+   /*
+* RTF_CACHED entry are not deleted as long as
+* their parent gateway route is alive, so make
+* sure to update its sibling which might be on
+* a different interface to not leave them as
+* unresolved.
+*/
+   while ((rt = rtable_iterate(rt)) != NULL) {
+   struct ifnet *ifp0;
+
+   ifp0 = if_get(rt->rt_ifidx);
+   if (ifp0 != NULL)
+   error = arpcache(ifp0, ea, rt);
+   if_put(ifp0);
+   }
+   }
KERNEL_UNLOCK();
if (error)
goto out;

Re: 6.6-current stutters after heavy disk loads

2020-04-10 Thread Martin Pieuchot

On 10/04/20(Fri) 09:42, Martin wrote:
> Have you found anything regarding the issue?

No I haven't.

> Now I have time to add dt(4) in conf/GENERIC build & install a new kernel, 
> build & install btrace(8) and set kern.allowdt=1 in /etc/sysctl.conf.
> 
> Looks like dt(4) is a part of -current, but I can't move to -current right 
> now. I'm going to do it once 6.7 is released.

Thanks,
Martin

Re: arpresolve: XX: route contains no arp information

2020-04-10 Thread Martin Pieuchot

On 10/04/20(Fri) 11:18, Claudio Jeker wrote:
> On Fri, Apr 10, 2020 at 10:47:53AM +0200, Martin Pieuchot wrote:
> > On 09/04/20(Thu) 20:22, Laurent Salle wrote:
> > > On 08/04/2020 06.52, Martin Pieuchot wrote:
> > > 
> > > > It's the same bug as reported by sthen@.  Two interfaces in the same 
> > > > subnet
> > > > have two identical cloning routes:
> > > 
> > > I've been able to reproduce systematically the problem with an OpenBSD
> > > virtual machine running the latest snapshot and two vio interface with
> > > different priority connected to the same lan with dhcp.
> > 
> > Thanks for the report!  Diff below seems to fix the issue here, could
> > you try it?
> 
> I'm not convinced that this is the right solution. In your diff you insert
> the MAC received on one interface into the arp node of another interface.
> This feels wrong, arp entries should never cross over interfaces.
> For example if for some reasons the two interfaces have the same gateway
> IP but use different MACs for that IP then this breaks.

Makes sense.

Well it looks like when the default route on if0 tries to use the L2
route underneath it, the ARP layer resolve the entry on if1 instead of
on if0.

The route on if0 is being used because it has higher priority, however
the L2 entry on if1 has been inserted first.  I haven't debugged
further.

Re: arpresolve: XX: route contains no arp information

2020-04-10 Thread Martin Pieuchot

On 10/04/20(Fri) 13:19, Claudio Jeker wrote:
> On Fri, Apr 10, 2020 at 12:14:17PM +0200, Martin Pieuchot wrote:
> > On 10/04/20(Fri) 11:18, Claudio Jeker wrote:
> > > On Fri, Apr 10, 2020 at 10:47:53AM +0200, Martin Pieuchot wrote:
> > > > On 09/04/20(Thu) 20:22, Laurent Salle wrote:
> > > > > On 08/04/2020 06.52, Martin Pieuchot wrote:
> > > > > 
> > > > > > It's the same bug as reported by sthen@.  Two interfaces in the 
> > > > > > same subnet
> > > > > > have two identical cloning routes:
> > > > > 
> > > > > I've been able to reproduce systematically the problem with an OpenBSD
> > > > > virtual machine running the latest snapshot and two vio interface with
> > > > > different priority connected to the same lan with dhcp.
> > > > 
> > > > Thanks for the report!  Diff below seems to fix the issue here, could
> > > > you try it?
> > > 
> > > I'm not convinced that this is the right solution. In your diff you insert
> > > the MAC received on one interface into the arp node of another interface.
> > > This feels wrong, arp entries should never cross over interfaces.
> > > For example if for some reasons the two interfaces have the same gateway
> > > IP but use different MACs for that IP then this breaks.
> > 
> > Makes sense.
> > 
> > Well it looks like when the default route on if0 tries to use the L2
> > route underneath it, the ARP layer resolve the entry on if1 instead of
> > on if0.
> > 
> > The route on if0 is being used because it has higher priority, however
> > the L2 entry on if1 has been inserted first.  I haven't debugged
> > further.
> 
> Yes, this comes from the fact that rtalloc() will find the gw route of the
> wrong interface and not clone a new entry from the other interface and so
> the rt_gwroute cache is all messed up.

Do you know which particular rtalloc(9) we're talking about?

Re: [pc engines apu1d4] kernel crash periodically

2020-04-10 Thread Martin Pieuchot

On 10/04/20(Fri) 12:18, Pascal Cabaud wrote:
> Hello Martin,
> 
> Thanks for searching bits in this bug report. Are the APU really popular
> to run OpenBSD or is there a problem with them: AFAICS, i've found many
> reports in archives...

The problem is unlikely to be related to the hardware.

> Le 2020-04-10 09:58, Martin Pieuchot disait :
> > Thanks for the report.  This looks to me like a memory corruption in
> > the kernel.  This is hard to understand because the panic(9) only show
> > the symptom, not the actual bug.
> 
> Ok, I'm recording console with GNU Screen. Let's wait...
> 
> To play with daemons, it'll be more difficult, i've to find backup
> hardware first.

I understand, however wouldn't be surprised if the issue is exposed by
one of the daemon doing a lot of stuff and dealing with network, relayd
or collectd maybe.

Re: splassert w/ add/del vlan on bridge

2020-04-11 Thread Martin Pieuchot

On 11/04/20(Sat) 23:09, David Gwynne wrote:
> On Sat, Apr 11, 2020 at 03:21:49AM +, Visa Hankala wrote:
> > On Fri, Apr 10, 2020 at 01:30:47PM -0600, Theo de Raadt wrote:
> > > Why did it take almost a year to find this?
> > > 
> > > Or is this bug due to ioctl(2) becoming UNLOCKED on 2020/02/22?
> > 
> > This is not related to ioctl(2) becoming UNLOCKED. Lower-layer ioctl
> > code, soo_ioctl() included, lock the kernel when needed. However, most
> > .if_ioctl backends need NET_LOCK() in addition to KERNEL_LOCK(). In
> > most cases, that is satisfied by ifioctl() which acquires the lock
> > before invoking .if_ioctl(). bridge_ioctl() nullifies this by
> > releasing NET_LOCK().
> 
> yes.
> 
> i came up with the following diff before i read the thread here. it's
> largely identical to what you (visa) already came up with, but it adds
> some extra checks to ifpromisc based on the doco in around struct ifnet
> members in src/sys/net/if_var.h. i audited the rest of the ifpromisc
> calls and found another one in if_aggr that i was able to trigger.

The documentation says `if_pcount' is protected by the KERNEL_LOCK() but
in fact it is only read & modified in ifpromisc().

So I'd suggest fixing the documentation and not add another assert there.

> i think the only other call to ifpromisc outside src/sys/net is in carp,
> and i managed to convinced myself that all those calls hold NET_LOCK
> already.
> 
> Index: if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.601
> diff -u -p -r1.601 if.c
> --- if.c  10 Mar 2020 09:11:55 -  1.601
> +++ if.c  11 Apr 2020 13:08:46 -
> @@ -3031,7 +3031,9 @@ ifpromisc(struct ifnet *ifp, int pswitch
>   unsigned short oif_flags;
>   int oif_pcount, error;
>  
> + NET_ASSERT_LOCKED(); /* modifying if_flags */
>   oif_flags = ifp->if_flags;
> + KERNEL_ASSERT_LOCKED(); /* modifying if_pcount */
>   oif_pcount = ifp->if_pcount;
>   if (pswitch) {
>   if (ifp->if_pcount++ != 0)
> Index: if_aggr.c
> ===
> RCS file: /cvs/src/sys/net/if_aggr.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 if_aggr.c
> --- if_aggr.c 11 Mar 2020 07:01:42 -  1.28
> +++ if_aggr.c 11 Apr 2020 13:08:46 -
> @@ -589,8 +589,10 @@ aggr_clone_destroy(struct ifnet *ifp)
>   if_detach(ifp);
>  
>   /* last ref, no need to lock. aggr_p_dtor locks anyway */
> + NET_LOCK();
>   while ((p = TAILQ_FIRST(&sc->sc_ports)) != NULL)
>   aggr_p_dtor(sc, p, "destroy");
> + NET_UNLOCK();
>  
>   free(sc, M_DEVBUF, sizeof(*sc));
>  
> Index: if_bridge.c
> ===
> RCS file: /cvs/src/sys/net/if_bridge.c,v
> retrieving revision 1.338
> diff -u -p -r1.338 if_bridge.c
> --- if_bridge.c   6 Nov 2019 03:51:26 -   1.338
> +++ if_bridge.c   11 Apr 2020 13:08:46 -
> @@ -313,7 +313,9 @@ bridge_ioctl(struct ifnet *ifp, u_long c
>   break;
>   }
>  
> + NET_LOCK();
>   error = ifpromisc(ifs, 1);
> + NET_UNLOCK();
>   if (error != 0) {
>   free(bif, M_DEVBUF, sizeof(*bif));
>   break;
> @@ -558,7 +560,9 @@ bridge_ifremove(struct bridge_iflist *bi
>   }
>  
>   bif->ifp->if_bridgeidx = 0;
> + NET_LOCK();
>   error = ifpromisc(bif->ifp, 0);
> + NET_UNLOCK();
>  
>   bridge_rtdelete(sc, bif->ifp, 0);
>   bridge_flushrule(bif);
> Index: if_tpmr.c
> ===
> RCS file: /cvs/src/sys/net/if_tpmr.c,v
> retrieving revision 1.9
> diff -u -p -r1.9 if_tpmr.c
> --- if_tpmr.c 11 Apr 2020 11:01:03 -  1.9
> +++ if_tpmr.c 11 Apr 2020 13:08:46 -
> @@ -201,12 +201,14 @@ tpmr_clone_destroy(struct ifnet *ifp)
>  
>   if_detach(ifp);
>  
> + NET_LOCK();
>   for (i = 0; i < nitems(sc->sc_ports); i++) {
>   struct tpmr_port *p = SMR_PTR_GET_LOCKED(&sc->sc_ports[i]);
>   if (p == NULL)
>   continue;
>   tpmr_p_dtor(sc, p, "destroy");
>   }
> + NET_UNLOCK();
>  
>   free(sc, M_DEVBUF, sizeof(*sc));
>  
>

Re: i915/drm vs WITNESS

2020-04-14 Thread Martin Pieuchot

On 26/02/20(Wed) 17:39, Mark Kettenis wrote:
> > Date: Wed, 12 Feb 2020 15:24:46 +0100
> > From: Martin Pieuchot 
> 
> Haven't forgotten about these.

The following are still present on 6.7-beta built from today's sources:

witness: lock order reversal:
 1st 0x8136a880 &rq->lock (&rq->lock)
 2nd 0x806a9050 rcs0 (&timeline->lock)
lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  __i915_request_submit+0x5b
#3  __execlists_submission_tasklet+0x1b9
#4  execlists_submit_request+0x1d1
#5  submit_notify+0x37
#6  __i915_sw_fence_complete+0x40
#7  i915_request_add+0x2d3
#8  i915_gem_init+0x2b9
#9  i915_driver_load+0x815
#10 inteldrm_attachhook+0x2c
#11 config_process_deferred_mountroot+0x6b
#12 main+0x75a
lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  execlists_submit_request+0x2a
#3  submit_notify+0x37
#4  __i915_sw_fence_complete+0x40
#5  dma_i915_sw_fence_wake+0x1d
#6  notify_ring+0x1a8
#7  gen8_gt_irq_handler+0xba
#8  gen8_irq_handler+0x114
#9  intr_handler+0x6e
#10 Xintr_ioapic_edge16_untramp+0x19f
#11 acpicpu_idle+0x1d2
#12 sched_idle+0x225
#13 proc_trampoline+0x1c
witness: lock order reversal:
 1st 0x8136b150 &wqh->lock (&wqh->lock)
 2nd 0x806a9050 rcs0 (&timeline->lock)
lock order "&wqh->lock"(mutex) -> "&timeline->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  execlists_submit_request+0x2a
#3  submit_notify+0x37
#4  __i915_sw_fence_complete+0x40
#5  i915_sw_fence_wake+0x39
#6  __i915_sw_fence_complete+0x131
#7  dma_i915_sw_fence_wake+0x1d
#8  notify_ring+0x1a8
#9  gen8_gt_irq_handler+0xba
#10 gen8_irq_handler+0x114
#11 intr_handler+0x6e
#12 Xintr_ioapic_edge16_untramp+0x19f
witness: acquiring duplicate lock of same type: "&wqh->lock"
 1st &wqh->lock
 2nd &wqh->lock
Starting stack trace...
witness_checkorder(8136bc30,9,0) at witness_checkorder+0x6ba
mtx_enter(8136bc20) at mtx_enter+0x34
__i915_sw_fence_complete(8136bc20,800033a6fc70) at 
__i915_sw_fence_complete+0x58
i915_sw_fence_wake(8136bc78,1,0,800033a6fc70) at 
i915_sw_fence_wake+0x39
__i915_sw_fence_complete(8136b140,0) at __i915_sw_fence_complete+0x131
dma_i915_sw_fence_wake(8136a008,8137f420) at 
dma_i915_sw_fence_wake+0x1d
notify_ring(80a84000) at notify_ring+0x1a8
gen8_gt_irq_handler(80154000,2,800033a6fdb0) at 
gen8_gt_irq_handler+0xba
gen8_irq_handler(0,80154078) at gen8_irq_handler+0x114
intr_handler(800033a6fe50,80144d80) at intr_handler+0x6e
Xintr_ioapic_edge16_untramp() at Xintr_ioapic_edge16_untramp+0x19f
end of kernel
end trace frame: 0x7f7def90, count: 246
End of stack trace.
witness: lock order reversal:
 1st 0x8136bb88 &rq->lock (&rq->lock)
 2nd 0x80a84050 bcs0 (&timeline->lock)
lock order "&timeline->lock"(mutex) -> "&rq->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  __i915_request_submit+0x5b
#3  __execlists_submission_tasklet+0x1b9
#4  execlists_submit_request+0x1d1
#5  submit_notify+0x37
#6  __i915_sw_fence_complete+0x40
#7  i915_request_add+0x2d3
#8  i915_gem_init+0x2cb
#9  i915_driver_load+0x815
#10 inteldrm_attachhook+0x2c
#11 config_process_deferred_mountroot+0x6b
#12 main+0x75a
lock order "&rq->lock"(mutex) -> "&timeline->lock"(mutex) first seen at:
#0  witness_checkorder+0x449
#1  mtx_enter+0x34
#2  execlists_submit_request+0x2a
#3  submit_notify+0x37
#4  __i915_sw_fence_complete+0x40
#5  dma_i915_sw_fence_wake+0x1d
#6  notify_ring+0x1a8
#7  gen8_gt_irq_handler+0x55
#8  gen8_irq_handler+0x114
#9  intr_handler+0x6e
#10 Xintr_ioapic_edge16_untramp+0x19f
#11 Xspllower+0x19
#12 mtx_enter_try+0x98
#13 mtx_enter+0x4a
#14 i915_vma_move_to_active+0x427
#15 i915_gem_do_execbuffer+0xb09
#16 i915_gem_execbuffer2_ioctl+0x144
#17 drmioctl+0xdc
#18 VOP_IOCTL+0x55

Re: Intermittent crashes on 6.5-stable with PC Engines APU2D4

2020-04-14 Thread Martin Pieuchot

On 14/10/19(Mon) 16:17, Alexander Bluhm wrote:
> On Fri, Oct 11, 2019 at 01:19:02PM +, L??vai, D??niel wrote:
> > uvm_fault(0xfd8124d90960, 0x7f884cecdcf8, 0, 2) -> e
  ^^
Do I understand correctly that the faulting page is 0x7f884cecd000?

PTE_BASE corresponds to 0x7f80, the VA in the fault above should
be 0x84cecdcf8000, in bluhm@'s report 0x27ea48908000.

Both reports involve multi-threaded programs.

Alexander what is the CPU of the machine where you can reproduce the
bug?

Are we trying to understand how a page storing PTEs can generate a
fault?  Is it what the traces say or am I completely on a wrong track?

> > kernel: page fault trap, code=0
> > Stopped at  pmap_page_remove+0x210: xchgq   %rax,0(%rcx,%rdx,1)
> 
> > ddb{3}> trace
> > pmap_page_remove(fd800975d480) at pmap_page_remove+0x210
> > uvm_anfree(fd8125d62b10) at uvm_anfree+0x36
> > amap_wipeout(fd8123d95170) at amap_wipeout+0xe5
> > uvm_unmap_detach(800022420fe8,0) at uvm_unmap_detach+0x90
> > sys_munmap(800022233cb8,800022421060,8000224210d0) at 
> > sys_munmap+0x11d
> > syscall(800022421140) at syscall+0x305
> > Xsyscall(6,49,109a8d931e10,49,109a58e72150,1099d9b9f000) at Xsyscall+0x128
> > end of kernel
> > end trace frame: 0x109a82dffa50, count: -7
> 
> I see this bug for a while now.
> 
> https://marc.info/?l=openbsd-bugs&m=156399483018833&w=2
> 
> I can trigger it by running /usr/src/regress/lib/libpthread/malloc_duel
> for some hours.  Moritz Buhl has tried to bisect the problem and
> it appears to exists since January 2019.  But it is hard to be sure
> as reproducing takes a while.  It is also unclear whether the change
> in behavior is caused by compiler, kernel, libc, libpthread or
> malloc_duel.  We could not trigger it with OpenBSD 6.4.
> 
> bluhm
>

Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed

2020-04-20 Thread Martin Pieuchot

Thanks for the report.

On 18/04/20(Sat) 18:17, Julian Brost wrote:
> I encountered a reproducible kernel panic during an accidental IPv6
> misconfiguration. In order to reproduce, the OpenBSD machine must be in
> the same subnet as a router that has fe80::1/64 configured and sends
> IPv6 route advertisements, for example with radvd using this config:
> 
>   interface eth0 {
> AdvSendAdvert on;
> MinRtrAdvInterval 10;
> MaxRtrAdvInterval 30;
> prefix 2001:db8::/64 {
>   AdvOnLink on;
>   AdvAutonomous on;
>   AdvRouterAddr on;
> };
>   };
> 
> With this setup, I was able to to reliably trigger the assertion using
> the following steps:
> 
> - Install Openbsd using 6.6/amd64 install66.iso
>   - IPv4: none
>   - IPv6: autoconf
> - Reboot into system, log in
> - echo inet6 alias fe80::1 64 >> /etc/hostname.vio0
>   # The file now contains the following:
>   #   inet6 autoconf
>   #   inet6 alias fe80::1 64
> - Reboot and log in again
> - ping6 2001::
>   # The exact address doesn't seem to matter, it also doesn't have to
>   # respond or anything. Sometimes this step isn't even necessary as the
>   # panic occurs by itself after the login prompt.
> - Wait a bit (less than a minute in my case) and observe the panic
>
[...]
> vio0: DAD detected duplicate IPv6 address fe80:1::1: NS in/out=0/1, NA in=1
> vio0: DAD complete for fe80:1::1 - duplicate found
> vio0: manual intervention required

Interesting :)

> login: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags,
> RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 727

That means some part of the ND code is incorrectly setting an `expire'
value to an entry that is local, and therefor should never expire.

Could you try to reproduce the issue with the diff below?  It should
also panic but points us to the place where the bug is.

Index: netinet6/nd6.c
===
RCS file: /cvs/src/sys/netinet6/nd6.c,v
retrieving revision 1.229
diff -u -p -r1.229 nd6.c
--- netinet6/nd6.c  29 Nov 2019 16:41:01 -  1.229
+++ netinet6/nd6.c  20 Apr 2020 10:07:15 -
@@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l
time_t expire = time_uptime + secs;
 
NET_ASSERT_LOCKED();
+   KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL));
 
ln->ln_rt->rt_expire = expire;
if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) {

Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed

2020-04-21 Thread Martin Pieuchot

On 20/04/20(Mon) 14:27, Julian Brost wrote:
> On 2020-04-20 12:14, Martin Pieuchot wrote:
> >> login: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags,
> >> RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 727
> > 
> > That means some part of the ND code is incorrectly setting an `expire'
> > value to an entry that is local, and therefor should never expire.
> > 
> > Could you try to reproduce the issue with the diff below?  It should
> > also panic but points us to the place where the bug is.
> > 
> > [...]
> With the diff applied, this is the panic message:
> 
> starting network
> vio0: DAD detected duplicate IPv6 address fe80:1::1: NS in/out=0/1, NA in=1
> vio0: DAD complete for fe80:1::1 - duplicate found
> vio0: manual intervention required
> reordering libraries:ndp info overwritten for fe80:1::1 by
> 76:fa:d3:57:ec:56 on vio0
> panic: kernel diagnostic assertion "!ISSET(ln->ln_rt->rt_flags,
> RTF_LOCAL)" failed: file "/usr/src/sys/netinet6/nd6.c", line 309
> Stopped at  db_enter+0x10:  popq%rbp
> 
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *457148  43436  0 0x14000  0x2000  softnet
> db_enter() at db_enter+0x10
> panic() at panic+0x128
> __assert(81c8d6ea,81c94c17,135,81c9fb69) at
> __assert+0x
> 2b
> 
> nd6_llinfo_settimer(fd803ec6ff00,15180) at nd6_llinfo_settimer+0xdf
> nd6_cache_lladdr(800972a8,800014a496a0,fd803714f874,8,86,0)
> at n
> d6_cache_lladdr+0x2be
> 
> nd6_rtr_cache(fd8036ee3000,28,38,86) at nd6_rtr_cache+0x31e
> icmp6_input(800014a499d8,800014a499e4,3a,18) at icmp6_input+0x33d
> ip_deliver(800014a499d8,800014a499e4,3a,18) at ip_deliver+0x1b3
> ip6_input_if(800014a499d8,800014a499e4,29,0,800972a8) at

Thanks, diff below fixes nd6_rtr_cache().  It was already skipping
static entries the same should be done for local entries.

I left the panic in there in case there's another place where the bug
can be triggered.

Index: netinet6/nd6.c
===
RCS file: /cvs/src/sys/netinet6/nd6.c,v
retrieving revision 1.229
diff -u -p -r1.229 nd6.c
--- netinet6/nd6.c  29 Nov 2019 16:41:01 -  1.229
+++ netinet6/nd6.c  21 Apr 2020 08:36:01 -
@@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l
time_t expire = time_uptime + secs;
 
NET_ASSERT_LOCKED();
+   KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL));
 
ln->ln_rt->rt_expire = expire;
if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) {
@@ -,17 +1112,11 @@ nd6_cache_lladdr(struct ifnet *ifp, stru
 
rt = nd6_lookup(from, 0, ifp, ifp->if_rdomain);
if (rt == NULL) {
-#if 0
-   /* nothing must be done if there's no lladdr */
-   if (!lladdr || !lladdrlen)
-   return NULL;
-#endif
-
rt = nd6_lookup(from, 1, ifp, ifp->if_rdomain);
is_newentry = 1;
} else {
-   /* do nothing if static ndp is set */
-   if (rt->rt_flags & RTF_STATIC) {
+   /* do not overwrite local or static entry */
+   if (ISSET(rt->rt_flags, RTF_STATIC|RTF_LOCAL)) {
rtfree(rt);
return;
}

Re: panic: kernel diagnostic assertion "!ISSET(rt->rt_flags, RTF_LOCAL)" failed

2020-04-21 Thread Martin Pieuchot

On 20/04/20(Mon) 15:44, Anton Lindqvist wrote:
> > Index: netinet6/nd6.c
> > ===
> > RCS file: /cvs/src/sys/netinet6/nd6.c,v
> > retrieving revision 1.229
> > diff -u -p -r1.229 nd6.c
> > --- netinet6/nd6.c  29 Nov 2019 16:41:01 -  1.229
> > +++ netinet6/nd6.c  20 Apr 2020 10:07:15 -
> > @@ -306,6 +306,7 @@ nd6_llinfo_settimer(struct llinfo_nd6 *l
> > time_t expire = time_uptime + secs;
> >  
> > NET_ASSERT_LOCKED();
> > +   KASSERT(!ISSET(ln->ln_rt->rt_flags, RTF_LOCAL));
> >  
> > ln->ln_rt->rt_expire = expire;
> > if (!timeout_pending(&nd6_timer_to) || expire < nd6_timer_next) {
> > 
> 
> Also found by syzkaller.
> 
> https://syzkaller.appspot.com/bug?extid=0eb994ff432ae75e3369

Maybe, maybe not.  Since the KASSERT() is in a timer we cannot be sure
the entry has been inserted in the global list by the same code path.

So it's hard to say if this is the same bug.

pty leak or corruption w/ openpty + dup2?

2020-04-29 Thread Martin Pieuchot

Program below is the smaller version of a syzkaller report [0].  After
running it one is left without usable console.  A second execution will
make openpty(3) pick a different "/dev/tty*" node:

  50361 crashCALL  ioctl(3,PTMGET,0x7f7eda80)
  50361 crashNAMI  "/dev/ptypd"
  50361 crashNAMI  "/dev/ttypd"
  50361 crashNAMI  "/dev/ttypd"
  50361 crashRET   ioctl 0

After some more tries:

  65559 crashCALL  ioctl(3,PTMGET,0x7f7c36a0)
  65559 crashNAMI  "/dev/ptypm"
  65559 crashNAMI  "/dev/ttypm"
  65559 crashNAMI  "/dev/ttypm"
  65559 crashRET   ioctl 0

[0] 
https://syzkaller.appspot.com/bug?id=a74718ca902617e6aa7327aa008b25844eccf2d3

- crash.c -

#include 
#include 

int 
main(void)
{
char garbage[100];
int master, slave;

if (openpty(&master, &slave, NULL, NULL, NULL) == -1)
return -1;
if (dup2(master, master + 100) != -1)
close(master);

write(slave, garbage, 99);

return 0;
}

Re: Xorg hangs on recent snapshots

2020-05-01 Thread Martin Pieuchot

Hello Mark,

Thanks for the report.

On 01/05/20(Fri) 16:51, Mark Patruck wrote:
> Problem:
> 
> With amdgpu(4) enabled, everything runs fine and smooth for minutes,
> sometimes hours (especially if you don't start lots of programs), but
> all of a sudden X freezes. That means, you can move your mouse, ssh in,
> also top and other programs are still running, but you have to kill -9
> X, to get back to business. This only applies for Polaris 11-see Results
> below.

Such 'freeze' is a symptom.  If you can ssh into the machine when that
happens a useful piece of informations would be the output of:

# ps -Sx -Owchan

similarly the output of "ps -S" would show where current threads are
blocking.

Another interesting piece of information would be the output of 'dmesg'
at that given moment.  The kernel might have printed some valuable
informations when something wrong happens.

Maybe /var/log/Xorg.0.log would also contain valuable informations.

These pieces of information might help us pinpoint the underlying
problem.

> [...] 
> I know about this thread on freedesktop.org [1], but again...
> before buying sth new, i'd like to know about your findings.
> 
> [1] https://bugs.freedesktop.org/show_bug.cgi?id=105733#c75

Do you know if it's the issue you're experiencing?

Re: pty leak or corruption w/ openpty + dup2?

2020-05-01 Thread Martin Pieuchot

On 01/05/20(Fri) 12:13, Anton Lindqvist wrote:
> The order in which the pty master/slave is closed seems to be the
> trigger here. While not duping the master, it's closed before the slave.
> In the opposite scenario, the slave is closed before the master. While
> closing the slave, it ends up here expressed as a simplified backtrace:
> 
>   tsleep()
>   ttysleep()
>   ttywait()
>   ttywflush()
>   ttylclose()
>   ptsclose()
>   fdfree()
>   exit1()
> 
> In order words, it ends up doing a tsleep(INFSLP) causing the thread to
> hang. Note that this is not the case when the master is closed before
> the slave since `tp->t_oproc == NULL' causing ttywait() to bail early.

Why is the sleeper never awaken?  Does that mean a ttwakeup() is missing?

> NetBSD does a sleep with a timeout in ttywflush(). I've applied the same
> approach in the diff below which does fix the hang.

This seems like a racy workaround for a bug that we do not fully
understand.  If this is a proper solution I'd be happy to understand
why.  If we go with such fix we should be using a value in "nsecs"
instead of ticks and INFSLP should be used instead of 0.  We should
refrain from introducing new usages of `hz' ;)

Re: pty leak or corruption w/ openpty + dup2?

2020-05-02 Thread Martin Pieuchot

On 02/05/20(Sat) 10:40, Anton Lindqvist wrote:
> On Fri, May 01, 2020 at 05:17:36PM +0200, Martin Pieuchot wrote:
> > On 01/05/20(Fri) 12:13, Anton Lindqvist wrote:
> > > The order in which the pty master/slave is closed seems to be the
> > > trigger here. While not duping the master, it's closed before the slave.
> > > In the opposite scenario, the slave is closed before the master. While
> > > closing the slave, it ends up here expressed as a simplified backtrace:
> > > 
> > >   tsleep()
> > >   ttysleep()
> > >   ttywait()
> > >   ttywflush()
> > >   ttylclose()
> > >   ptsclose()
> > >   fdfree()
> > >   exit1()
> > > 
> > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to
> > > hang. Note that this is not the case when the master is closed before
> > > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early.
> > 
> > Why is the sleeper never awaken?  Does that mean a ttwakeup() is missing?
> 
> In this case, the process is single threaded, about to exit and the only
> consumer of the pty. I don't see how it could be any other process
> responsibility to perform the wakeup.

Do we see that the issue is caused by the order in which descriptors are
closed in fdfree()?  The current deadlock occurs because the duped master
has a higher fd number than the slave which means it is still open when the
slave is closed.

But why would that be a problem?  By default *close() functions,
including ttylclose() are blocking.  So any exiting process might end up
hanging in fdfree().  Diff below illustrates that by forcing all *close()
during exit1() to be non-blocking, it also fix the issue.

Does it make sense to close fds as non-blocking when existing?  What
should a dying thread wait for?  What can be the cons of such approach? 

Now regarding your fix, why does it make sense to wait 5sec instead of
indefinitely?  Did you look at r1.263 of NetBSD's kern/tty.c?  If we go
with this change could you please change the 'timo' suffix and variables
to 'nsec' and use uint64_t instead of int?

Index: kern/vfs_vnops.c
===
RCS file: /cvs/src/sys/kern/vfs_vnops.c,v
retrieving revision 1.114
diff -u -p -r1.114 vfs_vnops.c
--- kern/vfs_vnops.c8 Apr 2020 08:07:51 -   1.114
+++ kern/vfs_vnops.c2 May 2020 09:18:28 -
@@ -601,6 +601,7 @@ vn_closefile(struct file *fp, struct pro
 {
struct vnode *vp = fp->f_data;
struct flock lf;
+   unsigned int flag;
int error;

KERNEL_LOCK();
@@ -611,7 +612,10 @@ vn_closefile(struct file *fp, struct pro
lf.l_type = F_UNLCK;
(void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK);
}
-   error = vn_close(vp, fp->f_flag, fp->f_cred, p);
+   flag = fp->f_flag;
+   if (p != NULL && p->p_flag & P_WEXIT)
+   flag |= O_NONBLOCK;
+   error = vn_close(vp, flag, fp->f_cred, p);
KERNEL_UNLOCK();
return (error);
 }

Re: pty leak or corruption w/ openpty + dup2?

2020-05-06 Thread Martin Pieuchot

On 02/05/20(Sat) 16:02, Mark Kettenis wrote:
> > Date: Sat, 2 May 2020 11:33:17 +0200
> > From: Martin Pieuchot 
> > [...]
> > Do we see that the issue is caused by the order in which descriptors are
> > closed in fdfree()?  The current deadlock occurs because the duped master
> > has a higher fd number than the slave which means it is still open when the
> > slave is closed.
> 
> I'm sure we could construct an example where the file descriptors are
> in a different oder.  So changing the order is not going to help.

Obviously :)

> > But why would that be a problem?  By default *close() functions,
> > including ttylclose() are blocking.  So any exiting process might end up
> > hanging in fdfree().  Diff below illustrates that by forcing all *close()
> > during exit1() to be non-blocking, it also fix the issue.
> 
> I very much fear that is going to have unintended side-effects with
> output not being flushed properly.  And the process could still
> deadlock itself by using close(2) directly isn't it?

Indeed.

> > Does it make sense to close fds as non-blocking when existing?  What
> > should a dying thread wait for?  What can be the cons of such approach? 
> > 
> > Now regarding your fix, why does it make sense to wait 5sec instead of
> > indefinitely?  Did you look at r1.263 of NetBSD's kern/tty.c?  If we go
> > with this change could you please change the 'timo' suffix and variables
> > to 'nsec' and use uint64_t instead of int?
> 
> r1.263 was reverted in r1.264.  Then r1.265 is the commit quoted by anton@.
> There is also r2.267 which adds an additional fix to r1.265.
> 
> In ttywait(), NetBSD only calls ttyflush() if there is a timeout. That
> makes sense, because we have ttywflush() to combine the wait and flush
> so ttywait() shouldn't flush when there is no error.

Updated diff below reflecting those changes.  I'm still questioning the
5sec timeout, but it is without doubt an improvement over the current
behavior.

The previously mentioned test as well as a modified version closing the
slave before exit(2) now hang for 5 seconds instead of deadlocking
indefinitely.

I believe we want that for release, ok?

Index: kern/tty.c
===
RCS file: /cvs/src/sys/kern/tty.c,v
retrieving revision 1.154
diff -u -p -r1.154 tty.c
--- kern/tty.c  7 Apr 2020 13:27:51 -   1.154
+++ kern/tty.c  6 May 2020 07:44:53 -
@@ -80,6 +80,8 @@ void  filt_ttyrdetach(struct knote *kn);
 intfilt_ttywrite(struct knote *kn, long hint);
 void   filt_ttywdetach(struct knote *kn);
 void   ttystats_init(struct itty **, size_t *);
+intttywait_nsec(struct tty *tp, uint64_t nsecs);
+intttysleep_nsec(struct tty *, void *, int, char *, uint64_t);
 
 /* Symbolic sleep message strings. */
 char ttclos[]  = "ttycls";
@@ -1202,10 +1204,10 @@ ttnread(struct tty *tp)
 }
 
 /*
- * Wait for output to drain.
+ * Wait for output to drain, or if this times out, flush it.
  */
 int
-ttywait(struct tty *tp)
+ttywait_nsec(struct tty *tp, uint64_t nsecs)
 {
int error, s;
 
@@ -1219,7 +1221,10 @@ ttywait(struct tty *tp)
(ISSET(tp->t_state, TS_CARR_ON) || ISSET(tp->t_cflag, 
CLOCAL))
&& tp->t_oproc) {
SET(tp->t_state, TS_ASLEEP);
-   error = ttysleep(tp, &tp->t_outq, TTOPRI | PCATCH, 
ttyout);
+   error = ttysleep_nsec(tp, &tp->t_outq, TTOPRI | PCATCH,
+   ttyout, nsecs);
+   if (error == EWOULDBLOCK)
+   ttyflush(tp, FWRITE);
if (error)
break;
} else
@@ -1229,6 +1234,12 @@ ttywait(struct tty *tp)
return (error);
 }
 
+int
+ttywait(struct tty *tp)
+{
+   return (ttywait_nsec(tp, INFSLP));
+}
+
 /*
  * Flush if successfully wait.
  */
@@ -1237,7 +1248,8 @@ ttywflush(struct tty *tp)
 {
int error;
 
-   if ((error = ttywait(tp)) == 0)
+   error = ttywait_nsec(tp, SEC_TO_NSEC(5));
+   if (error == 0 || error == EWOULDBLOCK)
ttyflush(tp, FREAD);
return (error);
 }
@@ -2281,11 +2293,18 @@ tputchar(int c, struct tty *tp)
 int
 ttysleep(struct tty *tp, void *chan, int pri, char *wmesg)
 {
+
+   return (ttysleep_nsec(tp, chan, pri, wmesg, INFSLP));
+}
+
+int
+ttysleep_nsec(struct tty *tp, void *chan, int pri, char *wmesg, uint64_t nsecs)
+{
int error;
short gen;
 
gen = tp->t_gen;
-   if ((error = tsleep_nsec(chan, pri, wmesg, INFSLP)) != 0)
+   if ((error = tsleep_nsec(chan, pri, wmesg, nsecs)) != 0)
return (error);
return (tp->t_gen == gen ? 0 : ERESTART);
 }

wsemul_vt100 & wsmux's ioctl rwlock taken in interrupt context

2020-05-06 Thread Martin Pieuchot

Following backtrace found by robert@'s syzkaller exposes a context / 
locking issue related to wsmux's ioctl rwlock:

  panic: acquiring blockable sleep lock with spinlock or critical section held 
(rwlock) wsmuxlk

trace:

panic+0x15c
witness_checkorder+0x10e0
rw_enter_read+0x66
wsmux_do_displayioctl+0x7e
wsdisplay_emulbell+0x68
wsemul_vt100_output_c0c1+0x2f5
wsemul_vt100_output+0x34e
wsdisplaystart+0x396
ttrstrt+0x4b
timeout_run+0xc4
softclock+0x175
softintr_dispatch+0x107
Xsoftclock+0x1f


Grabbing `sc_lock' should obviously not be possible from softclock context.
I'm not sure what's the best way to fix this issue.  timeout_set_proc(9)
will make the warning disappear but is it the right thing to do?

Is there other interrupt-context paths that can enter this code?

The lock has been introduced to prevent access to `sc_cld' in case a
thread was sleeping in the middle of an operation.  Are we sure those
sleeping points cannot be reached by entry points from interrupt
context?

Did we consider alternative fixes than a lock?

Re: OpenBSD 6.7 crashes on APU2C4 with LTE modem Huawei E3372s-153 HiLink

2020-05-25 Thread Martin Pieuchot

On 25/05/20(Mon) 12:56, Gerhard Roth wrote:
> On 5/22/20 9:05 PM, Mark Kettenis wrote:
> > > From: Łukasz Lejtkowski 
> > > Date: Fri, 22 May 2020 20:51:57 +0200
> > > 
> > > Probably power supply 12 V is broken. Showing 16,87 V(Fluke 179) -
> > > too high. Should be 12,25-12,50 V. I replaced to the new one.
> > 
> > That might be why the device stops responding.  The fact that cleaning
> > up from a failed USB transaction leads to this panic is a bug though.
> > 
> > And somebody just posted a very similar panic with ure(4).  Something
> > in the network stack is holding a mutex when it shouldn't.
> 
> I think that holding the mutex is ok. The bug is calling the stop
> routine in case of errors.
> 
> This is what common foo_start() does:
> 
>   m_head = ifq_deq_begin(&ifp->if_snd);
>   if (foo_encap(sc, m_head, 0)) {
>   ifq_deq_rollback(&ifp->if_snd, m_head);
>   ...
>   return;
>   }
>   ifq_deq_commit(&ifp->if_snd, m_head);
> 
> Here, ifq_deq_begin() grabs a mutex and it is held while
> calling foo_encap().
> 
> For USB network interfaces foo_encap() mostly does this:
> 
>   err = usbd_transfer(sc->sc_xfer);
>   if (err != USBD_IN_PROGRESS) {
>   foo_stop(sc);
>   return EIO;
>   }
> 
> And foo_stop() calls usbd_abort_pipe() -> xhci_command_submit(),
> which might sleep.
> 
> How to fix? We could do the foo_encap() after the ifq_deq_commit(),
> possibly dropping the current mbuf if encap fails (who cares
> for the packets after foo_stop() anyway).

That's the approach taken by drivers using ifq_dequeue(9) instead of
ifq_deq_begin/commit().

> Or change all the drivers to follow the path that if_aue.c takes:
> 
>   err = usbd_transfer(c->aue_xfer);
>   if (err != USBD_IN_PROGRESS) {
>   ...
>   /* Stop the interface from process context. */
>   usb_add_task(sc->aue_udev, &sc->aue_stop_task);
>   return (EIO);
>   }

That's just trading the current problem for another one with higher
complexity.

> Any ideas, what's better? Or alternative proposals?

Using ifq_dequeue(9) would have the advantage of unifying the code base.
It introduces a behavior change.  A simpler fix would be to call
foo_stop() in the error path after ifq_deq_rollback().

Re: X hangs

2020-06-09 Thread Martin Pieuchot

On 29/05/20(Fri) 15:57, Visa Hankala wrote:
> On Fri, May 29, 2020 at 04:27:46PM +0200, Alexandre Ratchov wrote:
> > On Thu, May 28, 2020 at 01:41:43PM +0100, Stuart Henderson wrote:
> > > uaudio0 at uhub7 port 2 configuration 1 interface 1 "GN Netcom GN 9350" 
> > > rev 2.00/1.00 addr 7
> > > uaudio0: class v1, full-speed, sync, channels: 1 play, 1 rec, 4 ctls
> > > audio1 at uaudio0
> > > uhidev0 at uhub7 port 2 configuration 1 interface 3 "GN Netcom GN 9350" 
> > > rev 2.00/1.00 addr 7
> > > uhidev0: iclass 3/0
> > > uhid0 at uhidev0: input=2, output=2, feature=0
> > > uaudio0: can't reset interface
> > > uaudio0: can't reset interface
> > > audio1 detached
> > > uaudio0 detached
> > > uhid0 detached
> > > uhidev0 detached
> > > RA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xdeRA\xaf\xde: 
> > > can't set interface
> > > kernel: protection fault trap, code=0
> > > Stopped at  uaudio_stream_close+0x8a:   movzbl  0x8(%r12),%esi
> > > ddb{3}> [-- sthen@localhost attached -- Thu May 28 11:58:19 2020]
> > > 
> > > ddb{3}> 
> > > ddb{3}> tr
> > > uaudio_stream_close(81dfb000,1) at uaudio_stream_close+0x8a
> > > uaudio_stream_open(81dfb000,1,801e8000,801eaa80,2a8,816f7630)
> > >  at uaudio_stream_open+0x761
> > > uaudio_trigger_output(81dfb000,801e8000,801eaa80,2a8,816f7630,81e95c00)
> > >  at uaudio_trigger_output+0x47
> > > audio_start_do(81e95c00) at audio_start_do+0xb5
> > > audioioctl(2a01,20004126,800035a74470,7,800034fe6750) at 
> > > audioioctl+0x71
> > > VOP_IOCTL(fd867a72e9e0,20004126,800035a74470,7,fd84fea6f9c0,800034fe6750)
> > >  at VOP_IOCTL+0x55
> > > vn_ioctl(fd867d490f10,20004126,800035a74470,800034fe6750) at 
> > > vn_ioctl+0x75
> > > sys_ioctl(800034fe6750,800035a74580,800035a745e0) at 
> > > sys_ioctl+0x2df
> > > syscall(800035a74650) at syscall+0x389
> > > Xsyscall() at Xsyscall+0x128
> > > end of kernel
> > 
> > According to dmesg, audio1 was detached, so we shouldn't enter
> > audio_start_do().
> > 
> > At this point the DVF_ACTIVE flag is clear; audioioctl() calls
> > device_lookup() which is supposed to return NULL in this case, so
> > ioctl() is supposed to return ENXIO, not attempt to start playback.
> 
> Lets assume that audio_start_do() started when the device was still
> attached to the system. In that case device_lookup() returned a pointer
> to a good softc. This is supported by the fact that audio_start_do() did
> not crash earlier.
> 
> Did usbd_set_interface() block for a moment, letting the detachment
> happen? The trace suggests that usbd_set_interface() failed, and when
> audio_start_do() resumed, sc pointed to freed memory.

The audio(4) drivers has an unaccounted reference to uaudio(4)'s softc.
So when the USB thread responsible for detaching device kicks in to
clean up the software state of an uaudio(4), it first spins on the
KERNEL_LOCK().  If any of the threads playing/recording audio sleeps
while holding an unaccounted reference to uaudio(4)'s softc, the above
issue can happen.

A way to fix this is to use usbd_ref_incr(9) and its counterpart
usbd_ref_wait(9) in uaudio_detach().

I'm not sure if it's possible for audio(4) to increment the reference
only once.  Is there a place where such increment/decrement can be put?

Otherwise every operation should do the dance.

Re: ipmi problem introduced with sys/conf.h 1.150 enodev->selfalse

2020-06-29 Thread Martin Pieuchot

On 28/06/20(Sun) 22:17, Stuart Henderson wrote:
> Thanks to Jens A. Griepentrog for reporting and bisecting, we discovered
> that sys/conf.h r1.150 broke /dev/ipmi. I found a machine to test on and
> reverting the commit fixes things, but given the commit message I guess
> the diff below (which also fixes it) might be better?

Thanks for the finding.  Your diff is indeed better and is ok mpi@.

Could you please commit the version below that adds a matching kqfilter
filter for `seltrue' as well?  That will allow us to keep the behavior
when switching poll(2) to use kqueue filters.

Index: sys/conf.h
===
RCS file: /cvs/src/sys/sys/conf.h,v
retrieving revision 1.152
diff -u -p -r1.152 conf.h
--- sys/conf.h  26 May 2020 07:53:00 -  1.152
+++ sys/conf.h  29 Jun 2020 07:22:40 -
@@ -473,8 +473,8 @@ extern struct cdevsw cdevsw[];
 #define cdev_ipmi_init(c,n) { \
dev_init(c,n,open), dev_init(c,n,close), (dev_type_read((*))) enodev, \
(dev_type_write((*))) enodev, dev_init(c,n,ioctl), \
-   (dev_type_stop((*))) enodev, 0, selfalse, \
-   (dev_type_mmap((*))) enodev, 0 }
+   (dev_type_stop((*))) enodev, 0, seltrue, (dev_type_mmap((*))) enodev, \
+   0, 0, seltrue_kqfilter }
 
 /* open, close, ioctl, mmap */
 #define cdev_kcov_init(c,n) { \

Re: Supermicro X10SDV-TP8F with USB3 won't boot

2016-05-31 Thread Martin Pieuchot

On 06/05/16(Fri) 01:13, Hrvoje Popovski wrote:
> Hi,
> 
> I've got
> http://www.supermicro.com/products/motherboard/Xeon/D/X10SDV-TP8F.cfm
> for my openbsd lab. Default BIOS settings for usb is USB3 and with that
> settings i can't install openbsd on it, or boot installed openbsd.
> I have installed openbsd with disabled USB3 ie. USB2, complie kernel
> with USB_DEBUG, EHCI_DEBUG, XHCI_DEBUG, UHCI_DEBUG, enable USB3 in BIOS
> and boot... this is screenshot..
> http://kosjenka.srce.hr/~hrvoje/openbsd/usb.jpg

Could you please tell me if the diff below solves your problem?

Index: xhci_pci.c
===
RCS file: /cvs/src/sys/dev/pci/xhci_pci.c,v
retrieving revision 1.7
diff -u -p -r1.7 xhci_pci.c
--- xhci_pci.c  2 Nov 2015 14:53:10 -   1.7
+++ xhci_pci.c  31 May 2016 16:36:14 -
@@ -258,8 +258,9 @@ xhci_pci_takecontroller(struct xhci_pci_
eec = -1;
 
/* Synchronise with the BIOS if it owns the controller. */
-   for (xecp = XHCI_HCC_XECP(cparams) << 2; xecp != 0;
-   xecp = XHCI_XECP_NEXT(eec) << 2) {
+   for (xecp = XHCI_HCC_XECP(cparams) << 2;
+   xecp != 0 && XHCI_XECP_NEXT(eec);
+   xecp += XHCI_XECP_NEXT(eec) << 2) {
eec = XREAD4(&psc->sc, xecp);
if (XHCI_XECP_ID(eec) != XHCI_ID_USB_LEGACY)
continue;

Re: Supermicro X10SDV-TP8F with USB3 won't boot

2016-05-31 Thread Martin Pieuchot

On 31/05/16(Tue) 21:11, Evgeniy Sudyr wrote:
> Hrvoje,
> 
> looks my last comment was wrong. I apologise for detracting from this
> important discussion.
> 
> We all need support / fix for xhci(4) driver instead of disabling USB
> 3.0 support.
> 
> I have same issue on my desktop with Asus z170-k mainboard which is
> based on Intel z170 chipset http://ark.intel.com/products/90591
> 
> Also my friend have Intel C236 chipset
> http://ark.intel.com/products/90594 and he also have same issue on
> board with both acpi and xhci
> http://www.supermicro.com/products/motherboard/Xeon/C236_C232/X11SSH-TF.cfm
> 
> I will be glad to help test patches on hardware above.

I just committed the fix, please wait for the next snapshot or build
from sources.

Martin

Re: suspend resumes immediately on Toshiba Portege R30-A-1CD

2016-06-07 Thread Martin Pieuchot

On 06/06/16(Mon) 23:20, Giovanni Bechis wrote:
> On Sun, Jun 05, 2016 at 09:39:23PM +0200, giova...@paclan.it wrote:
> > >Synopsis:  if I suspend my Toshiba laptop it resumes immediately
> > >Category:  kernel/acpi
> > >Environment:
> > System  : OpenBSD 6.0
> > Details : OpenBSD 6.0-beta (GENERIC.MP) #2150: Mon May 30 20:21:47 
> > MDT 2016
> >  
> > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.amd64
> > Machine : amd64
> > >Description:
> > If I suspend my Toshiba Portege R30-A-1CD laptop with lid close or with
> > zzz(8) it resumes immediately.
> > Hybernation works as expected.
> > >How-To-Repeat:
> > Suspend a Toshiba R30-A-1CD laptop.
> > >Fix:
> > Unknown.
> > 
> by disabling xhci(4) via config(8) I can suspend and resume, a strange
> thing is that to resume I have to press the power button, any other key does
> not do the job.

How does your dmesg look like after trying to suspend?

Re: Touchscreen device is not calibrated after wake from suspend

2016-06-07 Thread Martin Pieuchot

On 04/06/16(Sat) 14:49, Edd Barrett wrote:
> Hi,
> 
> My x240t has a touch screen and a stylus. It works well upon first boot
> (asides from the pointer co-ordinates are not yet translated when the
> screen is rotated). However, after suspend and wake, placing the pen on
> or near the screen will cause the pointer to jump to the bottom right
> hand corner of the screen.
> 
> I can fix this with:
> 
>  $ xinput --disable /dev/wsmouse3
>  $ xinput --enable /dev/wsmouse3
> 
> Looking in my mail archives, I see that I spoke to mpi and matthieu
> about this a while back (adding to CC). We did not find a suitable fix
> and the problem persists.
> 
> Matthieu's working theory was (and still is Matthieu?) as follows:
> 
> ---8<---
> * Machine is resuming
> * X comes back and at the same time USB devices reattaches
> * the above is racy, so sometimes X comes back before its previous input
>   devices are back
> * When that happens, X cannot reopen the input device, so it disables
>   it (but not cleanly - thats another issue I want to look at)
> * When the USB device is reattached later, it gets back to the mux
> * xf86-input-ws only gets events though the mux and thus can't apply
>   proper calibration
> --->8---

The problem starts when you suspend.  Your device is detached.  So
next time X will try to read from the corresponding /dev/wsmouse1 node
it will fail.  Now in practice X reads after resuming.

Maybe Ulf has an idea of how to move the calibration logic to the kernel
such that as soon as a new device attaches, it gets calibrated to match
the corresponding screen.

Re: panic in upd

2016-06-07 Thread Martin Pieuchot

On 01/06/16(Wed) 16:39, Martijn van Duren wrote:
> [...] 
> upd0 detached
> uhidev0 detached
> kernel: protection fault trap, code=0
> Stopped atupd_sensor_invalidate+0xe:  movq0xc8(%rsi),%rbx
> ddb{0}> trace
> upd_sensor_invalidate() at upd_sensor_invalidate+0xe
> upd_update_report_cb() at upd_update_report_cb+0x5b
> uhidev_get_report_async_cb() at uhidev_get_report_async_cb+0x39
> usb_transfer_complete() at usb_transfer_complete+0x26c
> xhci_event_command() at xhci_event_command+0x1c8
> xhci_event_dequeue() at xhci_event_dequeue+0x8a
> xhci_softintr() at xhci_softintr+0x21
> softintr_dispatch() at softintr_dispatch+0x8b
> end of kernel
> end trace frame: 0x72defae4a00, count: -8

This looks like a race between the asynchronous callback and the device
being detached.

The problem is that the driver already freed its memory when the
transfer completed.  By checking if the device is dying before calling
the callback we should prevent such crash.

Could you at least confirm that the diff below does not introduce any
regression?

Index: uhidev.c
===
RCS file: /cvs/src/sys/dev/usb/uhidev.c,v
retrieving revision 1.73
diff -u -p -r1.73 uhidev.c
--- uhidev.c9 Jan 2016 04:14:42 -   1.73
+++ uhidev.c7 Jun 2016 15:21:15 -
@@ -96,8 +96,7 @@ void uhidev_attach(struct device *, stru
 int uhidev_detach(struct device *, int);
 int uhidev_activate(struct device *, int);
 
-void uhidev_get_report_async_cb(struct usbd_xfer *xfer, void *priv,
-usbd_status status);
+void uhidev_get_report_async_cb(struct usbd_xfer *, void *, usbd_status);
 
 struct cfdriver uhidev_cd = {
NULL, "uhidev", DV_DULL
@@ -754,17 +753,19 @@ uhidev_get_report_async_cb(struct usbd_x
char *buf;
int len = -1;
 
-   if (err == USBD_NORMAL_COMPLETION || err == USBD_SHORT_XFER) {
-   len = xfer->actlen;
-   buf = KERNADDR(&xfer->dmabuf, 0);
-   if (info->id > 0) {
-   len--;
-   memcpy(info->data, buf + 1, len);
-   } else {
-   memcpy(info->data, buf, len);
+   if (!usbd_is_dying(xfer->pipe->device)) {
+   if (err == USBD_NORMAL_COMPLETION || err == USBD_SHORT_XFER) {
+   len = xfer->actlen;
+   buf = KERNADDR(&xfer->dmabuf, 0);
+   if (info->id > 0) {
+   len--;
+   memcpy(info->data, buf + 1, len);
+   } else {
+   memcpy(info->data, buf, len);
+   }
}
+   info->callback(info->priv, info->id, info->data, len);
}
-   info->callback(info->priv, info->id, info->data, len);
free(info, M_TEMP, sizeof(*info));
usbd_free_xfer(xfer);
 }

Re: uvm_fault in ip6_output_ipsec_lookup() / ip6_output()

2016-06-14 Thread Martin Pieuchot

On 14/06/16(Tue) 15:18, Florian Obser wrote:
> Hi,
> I'm seeing this panic on my v6 gateway running in a vm (don't ask):
> It has a v6 tunnel via HE on gif0.
> 
> I hope I copied all relevant information, if not, my appologies, I'm
> in a hurry currently, please just ask for more.
> 
> I will probably investigate more when I'm home :)
> 
> panic: trap type 6, code=0, pc=812fe70f
> Starting stack trace...
> panic() at panic+0x10b
> trap() at trap+0x7b8
> --- trap (number 6) ---
> ip6_output_ipsec_lookup() at ip6_output_ipsec_lookup+0x6f
> ip6_output() at ip6_output+0x21c
> esp_output_cb() at esp_output_cb+0x135
> taskq_thread() at taskq_thread+0x6c
> end trace frame: 0x0, count: 251
> End of stack trace.
> syncing disks... done

This seems to be an invalid `'tdbi'' dereference in ip6_output_ipsec_lookup():

2890:   tdbi = (struct tdb_ident *)(mtag + 1);
2891: HERE ->   if (tdbi->spi == tdb->tdb_spi &&
2892:   tdbi->proto == tdb->tdb_sproto &&
...

Markus, Mike any idea how this could happen?

> on:
> 
> OpenBSD 6.0-beta (GENERIC.MP) #2165: Thu Jun  2 08:37:59 MDT 2016
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
> 
> it has ddb.panic=0 but I can change that when I'm home.
> 
> [florian@openbsd:~]$ doas cat /etc/ipsec.conf
> ike esp from 2001:470:7afd::1 \
> to 2a02:d40:3:1:4c7:b9ff:fede:705f \
> psk XXX
> 
> ike esp from 2001:470:7afd:1::1 \
> to 2a02:d40:3:1:4c7:b9ff:fede:705f \
> psk XXX
> 
> ike esp from 2001:470:1f14:47e::2 \
> to 2a02:d40:3:1:4c7:b9ff:fede:705f \
> psk XXX
> 
> I can trigger the panic when the flows are up and I do this on the
> remote system:
> 
> [florian@tlakh:~]$ ping6 -I 2a02:d40:3:1:4c7:b9ff:fede:705f 2001:470:7afd::1
> 
> 
> [florian@openbsd:~]$ doas ipsecctl -sa
> FLOWS:
> flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 peer 
> 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use
> flow esp out from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f 
> peer 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require
> flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:7afd::1 peer 
> 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use
> flow esp out from 2001:470:7afd::1 to 2a02:d40:3:1:4c7:b9ff:fede:705f peer 
> 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require
> flow esp in from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:7afd:1::1 peer 
> 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type use
> flow esp out from 2001:470:7afd:1::1 to 2a02:d40:3:1:4c7:b9ff:fede:705f peer 
> 2a02:d40:3:1:4c7:b9ff:fede:705f srcid 2001:470:1f14:47e::2/128 dstid 
> 2a02:d40:3:1:4c7:b9ff:fede:705f/128 type require
> 
> SAD:
> esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi 
> 0x07b097ae auth hmac-sha2-256 enc aes
> esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi 
> 0x471d9a35 auth hmac-sha2-256 enc aes
> esp tunnel from 2001:470:1f14:47e::2 to 2a02:d40:3:1:4c7:b9ff:fede:705f spi 
> 0x4d6962f0 auth hmac-sha2-256 enc aes
> esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi 
> 0x546e354d auth hmac-sha2-256 enc aes
> esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi 
> 0x9d83602b auth hmac-sha2-256 enc aes
> esp tunnel from 2a02:d40:3:1:4c7:b9ff:fede:705f to 2001:470:1f14:47e::2 spi 
> 0xe2d99e91 auth hmac-sha2-256 enc aes
> 
> 
> [florian@openbsd:~]$ netstat -rn
> Routing tables
> 
> Internet:
> DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
> default192.168.2.254  UGS   17 4831 - 8 vio0
> 224/4  127.0.0.1  URS00 32768 8 lo0
> 10.11.12/2410.11.12.1 UC 10 - 4 vio1
> 10.11.12.1 52:54:00:15:bb:62  UHLl   01 - 1 vio1
> 10.11.12.3252:54:00:dc:6f:cd  UHLc   0  144 - 4 vio1
> 10.11.12.255   10.11.12.1 UHb00 - 1 vio1
> 127/8  127.0.0.1  UGRS   00 32768 8 lo0
> 127.0.0.1  127.0.0.1  UHl   12 1129 32768 1 lo0
> 192.168.2/24   192.168.2.253  UC 22 - 4 vio0
> 192.168.2.180:ee:73:67:d1:9c  UHLc   18 - 4 vio0
> 192.168.2.253  52:54:00:1a:59:59  UHLl   1 9560 - 1 vio0
> 192.168.2.254  4c:09:d4:ca:0c:b2  UHLc   15 - 4 vio0
> 192.168.2.255  192.168.2.253  UHb0   14 - 1 vio0
> 
> Internet6:
> Destination

Re: uvm_fault in ip6_output_ipsec_lookup() / ip6_output()

2016-06-15 Thread Martin Pieuchot

On 14/06/16(Tue) 20:10, Florian Obser wrote:
> On Tue, Jun 14, 2016 at 06:26:00PM +0200, Martin Pieuchot wrote:
> > On 14/06/16(Tue) 15:18, Florian Obser wrote:
> > > Hi,
> > > I'm seeing this panic on my v6 gateway running in a vm (don't ask):
> > > It has a v6 tunnel via HE on gif0.
> > > 
> > > I hope I copied all relevant information, if not, my appologies, I'm
> > > in a hurry currently, please just ask for more.
> > > 
> > > I will probably investigate more when I'm home :)
> > > 
> > > panic: trap type 6, code=0, pc=812fe70f
> > > Starting stack trace...
> > > panic() at panic+0x10b
> > > trap() at trap+0x7b8
> > > --- trap (number 6) ---
> > > ip6_output_ipsec_lookup() at ip6_output_ipsec_lookup+0x6f
> > > ip6_output() at ip6_output+0x21c
> > > esp_output_cb() at esp_output_cb+0x135
> > > taskq_thread() at taskq_thread+0x6c
> > > end trace frame: 0x0, count: 251
> > > End of stack trace.
> > > syncing disks... done
> > 
> > This seems to be an invalid `'tdbi'' dereference in 
> > ip6_output_ipsec_lookup():
> > 
> > 2890:   tdbi = (struct tdb_ident *)(mtag + 1);
> > 2891: HERE ->   if (tdbi->spi == tdb->tdb_spi &&
> > 2892:   tdbi->proto == tdb->tdb_sproto &&
> > ...
> > 
> > Markus, Mike any idea how this could happen?
> 
> I tracked it down to ref 1.89 of ip6_forward.c / ref 1.205 ip6_output.c:
> "factor out ipsec into ip6_output_ipsec_{lookup,send}(); ok mpi@, naddy@"
> 
> The problem is that we are not exiting the "loop detection" for loop
> when tdb is set to NULL. We enter again and dereference tdb -> boom.
> 
> The following diff makes ip6_output_ipsec_lookup() similar to
> ip_output_ipsec_lookup().
> It's easier to see what the diff is doing by applying and doing diff -b.
> OK?

ok mpi@

> p.s. I also note that the v4 and v6 version are really similiar, we
> can probably merge them. Wonder if it's worth it or if it's best to
> keep v4 and v6 seperate...

For the moment we're trying to reduce the size of "#ifdef IPSEC" chunks
inside the IP paths.  But reducing differences between v4 and v6 by
reusing code is a good thing.

> diff --git ip6_output.c ip6_output.c
> index 64eea86..3adaa7d 100644
> --- ip6_output.c
> +++ ip6_output.c
> @@ -2882,21 +2882,21 @@ ip6_output_ipsec_lookup(struct mbuf *m, int *error, 
> struct inpcb *inp)
>   tdb = ipsp_spd_lookup(m, AF_INET6, sizeof(struct ip6_hdr),
>   error, IPSP_DIRECTION_OUT, NULL, inp, 0);
>  
> - if (tdb != NULL) {
> - /* Loop detection */
> - for (mtag = m_tag_first(m); mtag != NULL;
> - mtag = m_tag_next(m, mtag)) {
> - if (mtag->m_tag_id != PACKET_TAG_IPSEC_OUT_DONE)
> - continue;
> - tdbi = (struct tdb_ident *)(mtag + 1);
> - if (tdbi->spi == tdb->tdb_spi &&
> - tdbi->proto == tdb->tdb_sproto &&
> - tdbi->rdomain == tdb->tdb_rdomain &&
> - !bcmp(&tdbi->dst, &tdb->tdb_dst,
> - sizeof(union sockaddr_union)))
> - tdb = NULL;
> + if (tdb == NULL)
> + return NULL;
> + /* Loop detection */
> + for (mtag = m_tag_first(m); mtag != NULL; mtag = m_tag_next(m, mtag)) {
> + if (mtag->m_tag_id != PACKET_TAG_IPSEC_OUT_DONE)
> + continue;
> + tdbi = (struct tdb_ident *)(mtag + 1);
> + if (tdbi->spi == tdb->tdb_spi &&
> + tdbi->proto == tdb->tdb_sproto &&
> + tdbi->rdomain == tdb->tdb_rdomain &&
> + !memcmp(&tdbi->dst, &tdb->tdb_dst,
> + sizeof(union sockaddr_union))) {
> + /* no IPsec needed */
> + return NULL;
>   }
> - /* We need to do IPsec */
>   }
>   return tdb;
>  }
> 
> 
> -- 
> I'm not entirely sure you are real.
>

Re: snapshot bsd.rd delay after umass at uhub (Lenovo x220 F5521gw WWAN)

2016-06-19 Thread Martin Pieuchot

On 19/06/16(Sun) 11:26, Marcus MERIGHI wrote:
> When booting bsd.rd, after the line
> 
> umass0 at uhub4 port 4 configuration 3 interface 0 "Lenovo F5521gw" rev
>   2.00/0.00 addr 3
> 
> there is a long (minutes) delay.
> 
> To me it seems bsd.rd these days finds a umass device the so-called WWAN
> interface (GPRS/UMTS/LTE+GPS) provides.

Let me guess, this device doesn't attach as umass(4) with bsd.  I bet
the umass(4) driver is generating timeouts, If you don't use the device
you can disable it in your BIOS.

> 
> Possibly related:
> http://marc.info/?l=openbsd-tech&m=146619500807823
> "It revamps the way we look up interface descriptors quite a bit. I
> removed the unused code for matching devices based on vendor and product
> ids." (kettenis@)
> 
> dmesg of bsd.rd and unmodified self-compiled kernel below. lsusb -v
> output at the very end.
> 
> OpenBSD 6.0-beta (RAMDISK_CD) #1982: Sat Jun 18 11:42:13 MDT 2016
> r...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/RAMDISK_CD
> RTC BIOS diagnostic error 80
> real mem = 8451125248 (8059MB)
> avail mem = 8193282048 (7813MB)
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.6 @ 0xdae9c000 (64 entries)
> bios0: vendor LENOVO version "8DET72WW (1.42 )" date 02/18/2016
> bios0: LENOVO 4291QQ1
> acpi0 at bios0: rev 2
> acpi0: tables DSDT FACP SLIC SSDT SSDT SSDT HPET APIC MCFG ECDT ASF! TCPA 
> SSDT SSDT DMAR UEFI UEFI UEFI
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz, 797.54 MHz
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,LONG,LAHF,PERF,ITSC,SENSOR,ARAT
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: apic clock running at 99MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE
> cpu at mainbus0: not configured
> cpu at mainbus0: not configured
> cpu at mainbus0: not configured
> ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins
> acpiec0 at acpi0
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpiprt1 at acpi0: bus -1 (PEG_)
> acpiprt2 at acpi0: bus 2 (EXP1)
> acpiprt3 at acpi0: bus 3 (EXP2)
> acpiprt4 at acpi0: bus 5 (EXP4)
> acpiprt5 at acpi0: bus 13 (EXP5)
> acpiprt6 at acpi0: bus 14 (EXP7)
> acpicpu at acpi0 not configured
> acpipwrres at acpi0 not configured
> acpitz at acpi0 not configured
> "PNP0C0D" at acpi0 not configured
> "PNP0C0E" at acpi0 not configured
> "PNP0303" at acpi0 not configured
> "LEN0020" at acpi0 not configured
> "PNP0C0A" at acpi0 not configured
> "ACPI0003" at acpi0 not configured
> "LEN0068" at acpi0 not configured
> "PNP0C14" at acpi0 not configured
> "PNP0C14" at acpi0 not configured
> pci0 at mainbus0 bus 0
> pchb0 at pci0 dev 0 function 0 "Intel Core 2G Host" rev 0x09
> vga1 at pci0 dev 2 function 0 "Intel HD Graphics 3000" rev 0x09
> wsdisplay1 at vga1 mux 1: console (80x25, vt100 emulation)
> "Intel 6 Series MEI" rev 0x04 at pci0 dev 22 function 0 not configured
> em0 at pci0 dev 25 function 0 "Intel 82579LM" rev 0x04: msi, address 
> f0:de:f1:8f:84:ac
> ehci0 at pci0 dev 26 function 0 "Intel 6 Series USB" rev 0x04: apic 2 int 16
> usb0 at ehci0: USB revision 2.0
> uhub0 at usb0 "Intel EHCI root hub" rev 2.00/1.00 addr 1
> "Intel 6 Series HD Audio" rev 0x04 at pci0 dev 27 function 0 not configured
> ppb0 at pci0 dev 28 function 0 "Intel 6 Series PCIE" rev 0xb4: msi
> pci1 at ppb0 bus 2
> ppb1 at pci0 dev 28 function 1 "Intel 6 Series PCIE" rev 0xb4: msi
> pci2 at ppb1 bus 3
> iwn0 at pci2 dev 0 function 0 "Intel Centrino Ultimate-N 6300" rev 0x35: msi, 
> MIMO 3T3R, MoW, address 00:24:d7:f0:ea:90
> ppb2 at pci0 dev 28 function 3 "Intel 6 Series PCIE" rev 0xb4: msi
> pci3 at ppb2 bus 5
> ppb3 at pci0 dev 28 function 4 "Intel 6 Series PCIE" rev 0xb4: msi
> pci4 at ppb3 bus 13
> sdhc0 at pci4 dev 0 function 0 "Ricoh 5U822 SD/MMC" rev 0x07: apic 2 int 16
> sdhc0: SDHC 3.0, 50 MHz base clock
> sdmmc0 at sdhc0: 4-bit, sd high-speed, mmc high-speed, dma
> ppb4 at pci0 dev 28 function 6 "Intel 6 Series PCIE" rev 0xb4: msi
> pci5 at ppb4 bus 14
> xhci0 at pci5 dev 0 function 0 "NEC xHCI" rev 0x04: msi
> usb1 at xhci0: USB revision 3.0
> uhub1 at usb1 "NEC xHCI root hub" rev 3.00/1.00 addr 1
> ehci1 at pci0 dev 29 function 0 "Intel 6 Series USB" rev 0x04: apic 2 int 23
> usb2 at ehci1: USB revision 2.0
> uhub2 at usb2 "Intel EHCI root hub" rev 2.00/1.00 addr 1
> "Intel QM67 LPC" rev 0x04 at pci0 dev 31 function 0 not configured
> ahci0 at pci0 dev 31 function 2 "Intel 6 Series AHCI" rev 0x04: msi, AHCI 1.3
> ahci0: port 0: 6.0Gb/s
> scsibus0 at ahci0: 32 targets
> sd0 at scsibus0 targ 0 lun 0:  SCSI3 0/direct 
> fixed naa.5001b44e1d7ef244
> sd0: 228936MB, 512 bytes/sector, 468862128 sectors, thin
> "Intel 6 Series SMBus" rev 0x04 at pci0 dev 31 function 3 not configured
> isa0 at mainbus0
> pckbc0 at isa0 port

Re: snapshot bsd.rd delay after umass at uhub (Lenovo x220 F5521gw WWAN)

2016-06-19 Thread Martin Pieuchot

On 19/06/16(Sun) 14:23, Marcus MERIGHI wrote:
> m...@openbsd.org (Martin Pieuchot), 2016.06.19 (Sun) 13:28 (CEST):
> > On 19/06/16(Sun) 11:26, Marcus MERIGHI wrote:
> > > When booting bsd.rd, after the line
> > > 
> > > umass0 at uhub4 port 4 configuration 3 interface 0 "Lenovo F5521gw" rev
> > >   2.00/0.00 addr 3
> > > 
> > > there is a long (minutes) delay.
> > > 
> > > To me it seems bsd.rd these days finds a umass device the so-called WWAN
> > > interface (GPRS/UMTS/LTE+GPS) provides.
> > 
> > Let me guess, this device doesn't attach as umass(4) with bsd. 
> 
> True.
> But as opposed to Jun 2 snapshot I now get a ugen0 device (apart from
> three ucom(4)s). 
> 
> 
> > I bet the umass(4) driver is generating timeouts, If you don't use the
> > device you can disable it in your BIOS.
> 
> True. 
> Disabling is bad for finding out whether umb(4) would support the
> device...

Well if you want to debug it, build your own bsd.rd with UMASS_DEBUG
and/or SCSI_DEBUG to see which command is timing out.

1 2 3 4 5 6 >

1 - 100 of 527 matches

Mail list logo