Re: incorrect result from getppid for ptraced processes

2020-09-07 Thread Mateusz Guzik
On 9/5/20, Philip Guenther  wrote:
> On Fri, Sep 4, 2020 at 2:59 PM Mateusz Guzik  wrote:
>
>> On 9/5/20, Philip Guenther  wrote:
>> > On Fri, Sep 4, 2020 at 1:06 PM Mateusz Guzik  wrote:
>> >
>> >> On 9/4/20, Vitaliy Makkoveev  wrote:
>> >> > On Fri, Sep 04, 2020 at 05:24:42PM +0200, Mateusz Guzik wrote:
>> >> >> getppid blindly follows the parent pointer and reads the pid.
>> >> >>
>> >> >> The problem is that ptrace reparents the traced process, so in
>> >> >> particular if you gdb -p $something, the target proc will start
>> seeing
>> >> >> gdb instead of its actual parent.
>> >> >>
>> >> >> There is a lot to say about the entire reparenting business or
>> storing
>> >> >> the original pid in ps_oppid (instead of some form of a reference
>> >> >> to
>> >> >> the process).
>> >> >>
>> >> >> However, I think the most feasible fix for now is the same thing
>> >> >> FreeBSD did: *always* store the actual parent pid in ps_oppid. This
>> >> >> means all repareting will keep updating it (most notably when
>> >> >> abandoning children on exit), while ptrace will skip that part.
>> >> >>
>> >> >> Side effect of such a change be that getppid will stop requiring
>> >> >> the
>> >> >> kernel lock.
>> >> >>
>> >> >
>> >> > Thanks for report. But we are in beta stage now so such modification
>> is
>> >> > impossible until next iteration.
>> >> >
>> >> > Since original parent identifier is stored as `ps_oppid' while
>> >> > process
>> >> > is traced we just return it to userland for this case. This is the
>> >> > way
>> >> > I
>> >> > propose to fix this bug for now.
>> >> >
>> >> > Comments? OKs?
>> >> >
>> >> > Index: sys/kern/kern_prot.c
>> >> > ===
>> >> > RCS file: /cvs/src/sys/kern/kern_prot.c,v
>> >> > retrieving revision 1.76
>> >> > diff -u -p -r1.76 kern_prot.c
>> >> > --- sys/kern/kern_prot.c  9 Jul 2019 12:23:25 -   1.76
>> >> > +++ sys/kern/kern_prot.c  4 Sep 2020 21:12:15 -
>> >> > @@ -84,7 +84,11 @@ int
>> >> >  sys_getppid(struct proc *p, void *v, register_t *retval)
>> >> >  {
>> >> >
>> >> > - *retval = p->p_p->ps_pptr->ps_pid;
>> >> > + if (p->p_p->ps_flags & PS_TRACED)
>> >> > + *retval = p->p_p->ps_oppid;
>> >> > + else
>> >> > + *retval = p->p_p->ps_pptr->ps_pid;
>> >> > +
>> >> >   return (0);
>> >> >  }
>> >>
>> >> This is definitely a bare minimum fix, but it does the job.
>> >>
>> >
>> > ptrace() has behaved like this for the life of OpenBSD and an
>> > indefinite
>> > number of years previous in the BSD releases.  What has happened that a
>> > definitely incomplete fix is needed Right Now?
>>
>> I don't see how this reads as a demand this is fixed Right Now.
>>
>
> I didn't call it a demand, but the point stands: what has changed?
>

Nothing has changed. I don't use the system apart from sometimes
testing stuff in a VM. There is 0 pressure on my end to fix this.

>
> I don't see how the fix is incomplete either. It can be done better
>> with more effort, but AFAICS the above results in correct behavior.
>>
>
> There are at least 2 other uses of ps_pptr->ps_pid that should also change,
> unless you like coredumps and ps disagreeing with getppid(), and someone
> needs to think how it affects doas.
>

I see, I did not grep for these.

-- 
Mateusz Guzik 



Re: incorrect result from getppid for ptraced processes

2020-09-04 Thread Mateusz Guzik
On 9/5/20, Philip Guenther  wrote:
> On Fri, Sep 4, 2020 at 1:06 PM Mateusz Guzik  wrote:
>
>> On 9/4/20, Vitaliy Makkoveev  wrote:
>> > On Fri, Sep 04, 2020 at 05:24:42PM +0200, Mateusz Guzik wrote:
>> >> getppid blindly follows the parent pointer and reads the pid.
>> >>
>> >> The problem is that ptrace reparents the traced process, so in
>> >> particular if you gdb -p $something, the target proc will start seeing
>> >> gdb instead of its actual parent.
>> >>
>> >> There is a lot to say about the entire reparenting business or storing
>> >> the original pid in ps_oppid (instead of some form of a reference to
>> >> the process).
>> >>
>> >> However, I think the most feasible fix for now is the same thing
>> >> FreeBSD did: *always* store the actual parent pid in ps_oppid. This
>> >> means all repareting will keep updating it (most notably when
>> >> abandoning children on exit), while ptrace will skip that part.
>> >>
>> >> Side effect of such a change be that getppid will stop requiring the
>> >> kernel lock.
>> >>
>> >
>> > Thanks for report. But we are in beta stage now so such modification is
>> > impossible until next iteration.
>> >
>> > Since original parent identifier is stored as `ps_oppid' while process
>> > is traced we just return it to userland for this case. This is the way
>> > I
>> > propose to fix this bug for now.
>> >
>> > Comments? OKs?
>> >
>> > Index: sys/kern/kern_prot.c
>> > ===
>> > RCS file: /cvs/src/sys/kern/kern_prot.c,v
>> > retrieving revision 1.76
>> > diff -u -p -r1.76 kern_prot.c
>> > --- sys/kern/kern_prot.c  9 Jul 2019 12:23:25 -   1.76
>> > +++ sys/kern/kern_prot.c  4 Sep 2020 21:12:15 -
>> > @@ -84,7 +84,11 @@ int
>> >  sys_getppid(struct proc *p, void *v, register_t *retval)
>> >  {
>> >
>> > - *retval = p->p_p->ps_pptr->ps_pid;
>> > + if (p->p_p->ps_flags & PS_TRACED)
>> > + *retval = p->p_p->ps_oppid;
>> > + else
>> > + *retval = p->p_p->ps_pptr->ps_pid;
>> > +
>> >   return (0);
>> >  }
>>
>> This is definitely a bare minimum fix, but it does the job.
>>
>
> ptrace() has behaved like this for the life of OpenBSD and an indefinite
> number of years previous in the BSD releases.  What has happened that a
> definitely incomplete fix is needed Right Now?
>

I don't see how this reads as a demand this is fixed Right Now.

I don't see how the fix is incomplete either. It can be done better
with more effort, but AFAICS the above results in correct behavior.

-- 
Mateusz Guzik 



Re: incorrect result from getppid for ptraced processes

2020-09-04 Thread Mateusz Guzik
On 9/4/20, Vitaliy Makkoveev  wrote:
> On Fri, Sep 04, 2020 at 05:24:42PM +0200, Mateusz Guzik wrote:
>> getppid blindly follows the parent pointer and reads the pid.
>>
>> The problem is that ptrace reparents the traced process, so in
>> particular if you gdb -p $something, the target proc will start seeing
>> gdb instead of its actual parent.
>>
>> There is a lot to say about the entire reparenting business or storing
>> the original pid in ps_oppid (instead of some form of a reference to
>> the process).
>>
>> However, I think the most feasible fix for now is the same thing
>> FreeBSD did: *always* store the actual parent pid in ps_oppid. This
>> means all repareting will keep updating it (most notably when
>> abandoning children on exit), while ptrace will skip that part.
>>
>> Side effect of such a change be that getppid will stop requiring the
>> kernel lock.
>>
>
> Thanks for report. But we are in beta stage now so such modification is
> impossible until next iteration.
>
> Since original parent identifier is stored as `ps_oppid' while process
> is traced we just return it to userland for this case. This is the way I
> propose to fix this bug for now.
>
> Comments? OKs?
>
> Index: sys/kern/kern_prot.c
> ===
> RCS file: /cvs/src/sys/kern/kern_prot.c,v
> retrieving revision 1.76
> diff -u -p -r1.76 kern_prot.c
> --- sys/kern/kern_prot.c  9 Jul 2019 12:23:25 -   1.76
> +++ sys/kern/kern_prot.c  4 Sep 2020 21:12:15 -
> @@ -84,7 +84,11 @@ int
>  sys_getppid(struct proc *p, void *v, register_t *retval)
>  {
>
> - *retval = p->p_p->ps_pptr->ps_pid;
> + if (p->p_p->ps_flags & PS_TRACED)
> +         *retval = p->p_p->ps_oppid;
> + else
> + *retval = p->p_p->ps_pptr->ps_pid;
> +
>   return (0);
>  }
>
>

This is definitely a bare minimum fix, but it does the job.

-- 
Mateusz Guzik 



incorrect result from getppid for ptraced processes

2020-09-04 Thread Mateusz Guzik
getppid blindly follows the parent pointer and reads the pid.

The problem is that ptrace reparents the traced process, so in
particular if you gdb -p $something, the target proc will start seeing
gdb instead of its actual parent.

There is a lot to say about the entire reparenting business or storing
the original pid in ps_oppid (instead of some form of a reference to
the process).

However, I think the most feasible fix for now is the same thing
FreeBSD did: *always* store the actual parent pid in ps_oppid. This
means all repareting will keep updating it (most notably when
abandoning children on exit), while ptrace will skip that part.

Side effect of such a change be that getppid will stop requiring the
kernel lock.

-- 
Mateusz Guzik 



Re: Fix use of WITNESS_UNLOCK() in rw_exit_{read,write}()

2020-03-02 Thread Mateusz Guzik
Oops, sorry for the mixup below. I got the e-mail bounced from someone
and it used their 'From' instead of the original. Regardless,
technical contend stands. :)

On 3/2/20, Mateusz Guzik  wrote:
> On 2/29/20, Visa Hankala  wrote:
>> There is a bug in how rw_exit_read() and rw_exit_write() use
>> WITNESS_UNLOCK(). The fast paths call WITNESS_UNLOCK() after releasing
>> the actual lock. This leads to a use-after-free in the checker if the
>> lock is dynamically allocated and another thread happens to free it
>> too early.
>>
>> The following diff splits rw_exit() into two separate functions, one
>> which is the rw_exit(9) frontend, and the other, rw_do_exit(), which is
>> common between rw_exit(9), rw_exit_read(9) and rw_exit_write(9). This
>> makes it possible to call WITNESS_UNLOCK() before the actual lock
>> release without turning the code into an #ifdef maze.
>>
>> The diff additionally does the following tweaks:
>>
>> * Invoke membar_exit_before_atomic() only once even if the slow path is
>>   taken. The initial barrier is enough for the release semantics.
>>
>> * Read rwl_owner for rw_cas() as late as possible so that the CAS
>>   operation would be less likely to fail. Note that in rw_exit() the
>>   purpose of the read is different; it determines the type of the lock
>>   (read / write).
>>
>> * Put the rw_cas() in rw_exit() (now rw_do_exit()) inside
>>   __predict_false(). This compacts the resulting machine code a bit.
>>
>> The code could be simplified by removing the fast path from
>> rw_exit_read() and rw_exit_write(), and always calling rw_do_exit().
>> However, that would reduce performance a tiny amount. I saw about 0.5%
>> increase in kernel build time in a test setup. The patch below does
>> not seem to affect performance negatively.
>>
>> OK?
>
> If the slowdown is repeatable it most likely happens because of the
> immediate re-read in the loop in the slow path.
>
> Also note performance bugs in unlock:
> - *any* read unlock wakes up everyone, so if the lock has many blocked
> waiters they will all wake up only be likely to go back to sleep. On
> the other hand this may be an anomaly which sometimes improves
> performance because it in corner cases it can mitigate lack of
> adaptive spinning
> - the loop keeps re-reading the word instead of using the value
> returned by cmpxchg. so happens rw_cas macro explicitly throws it away
>
> Splitting unlock paths between reader and writer cases is unavoidable
> in the long term (to allow handling different wake up policies) and I
> think this is a great opportunity to do it all the way.
>
> That said, I propose you do roughly this:
>
> static void __always_inline
> rw_exit_read_impl(struct rwlock *rwl, unsigned long v)
> {
>
> rw_assert_rdlock(rwl);
> WITNESS_UNLOCK(>rwl_lock_obj, 0);
>
>  cas loop + wake up come here
> }
>
> void
> rw_exit_read(struct rwlock *rwl)
> {
>
> membar_exit_before_atomic();
> rw_exit_read_impl(rwl, rwl->rwl_owner);
> }
>
> void
> rw_exit(struct rwlock *rwl)
> {
> unsigned long v;
>
> membar_exit_before_atomic();
> v = rwl->rwl_owner;
> if (v & RWLOCK_WRLOCK)
> rw_exit_write_impl(rwl, v);
> else
> rw_exit_read_impl(rwl, v);
> }
>
> And similarly for write.
>
> That is, the bottom routines assume the fence was posted and they got
> a "fresh" value.
>
>>
>> Index: kern/kern_rwlock.c
>> ===
>> RCS file: src/sys/kern/kern_rwlock.c,v
>> retrieving revision 1.44
>> diff -u -p -r1.44 kern_rwlock.c
>> --- kern/kern_rwlock.c   30 Nov 2019 11:19:17 -  1.44
>> +++ kern/kern_rwlock.c   29 Feb 2020 16:24:59 -
>> @@ -25,6 +25,8 @@
>>  #include 
>>  #include 
>>
>> +voidrw_do_exit(struct rwlock *, unsigned long);
>> +
>>  /* XXX - temporary measure until proc0 is properly aligned */
>>  #define RW_PROC(p) (((long)p) & ~RWLOCK_MASK)
>>
>> @@ -129,31 +131,31 @@ rw_enter_write(struct rwlock *rwl)
>>  void
>>  rw_exit_read(struct rwlock *rwl)
>>  {
>> -unsigned long owner = rwl->rwl_owner;
>> +unsigned long owner;
>>
>>  rw_assert_rdlock(rwl);
>> +WITNESS_UNLOCK(>rwl_lock_obj, 0);
>>
>>  membar_exit_before_atomic();
>> +owner = rwl->rwl_owner;
>>  if (__predict_false((owner & RWLOCK_WAIT) ||
>>  rw_cas(>rwl_ow

Re: Fix use of WITNESS_UNLOCK() in rw_exit_{read,write}()

2020-03-02 Thread Mateusz Guzik
;rwl_lock_obj, LOP_EXCLUSIVE);
> + rw_do_exit(rwl, RWLOCK_WRLOCK);
>  }
>
>  #ifdef DIAGNOSTIC
> @@ -314,22 +316,29 @@ retry:
>  void
>  rw_exit(struct rwlock *rwl)
>  {
> - unsigned long owner = rwl->rwl_owner;
> - int wrlock = owner & RWLOCK_WRLOCK;
> - unsigned long set;
> + unsigned long wrlock;
>
>   /* Avoid deadlocks after panic or in DDB */
>   if (panicstr || db_active)
>   return;
>
> + wrlock = rwl->rwl_owner & RWLOCK_WRLOCK;
>   if (wrlock)
>   rw_assert_wrlock(rwl);
>   else
>   rw_assert_rdlock(rwl);
> -
>   WITNESS_UNLOCK(>rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0);
>
>   membar_exit_before_atomic();
> + rw_do_exit(rwl, wrlock);
> +}
> +
> +/* membar_exit_before_atomic() has to precede call of this function. */
> +void
> +rw_do_exit(struct rwlock *rwl, unsigned long wrlock)
> +{
> + unsigned long owner, set;
> +
>   do {
>       owner = rwl->rwl_owner;
>   if (wrlock)
> @@ -337,7 +346,7 @@ rw_exit(struct rwlock *rwl)
>   else
>   set = (owner - RWLOCK_READ_INCR) &
>   ~(RWLOCK_WAIT|RWLOCK_WRWANT);
> - } while (rw_cas(>rwl_owner, owner, set));
> + } while (__predict_false(rw_cas(>rwl_owner, owner, set)));
>
>   if (owner & RWLOCK_WAIT)
>   wakeup(rwl);
>
>


-- 
Mateusz Guzik 



Re: Removing PF

2019-04-01 Thread Mateusz Guzik
On 4/1/19, Claudio Jeker  wrote:
> There have been internal discussions about OpenBSD also removing the pf
> packet filter after the upcoming 6.5 release. Instead a switch to
> using David Gwynne's new bpf filter will happen.
> The benefits outweigh the drawbacks and the missing features will be
> readily implemented in time for the 6.6 release.
>
> --
> :wq Claudio
>

While I support pf removal, I don't think bpf is the way to go.

FreeBSD just removed their pf [1] so the code is up for grabs and you
can import it with one weird trick.

[1] https://lists.freebsd.org/pipermail/svn-src-projects/2019-April/013336.html

-- 
Mateusz Guzik 



Re: rasops(9): revert changes in rasops32_putchar()?

2019-03-23 Thread Mateusz Guzik
e zeroing was removed
quite some time ago. I see OpenBSD still has it, but it is most likely
detrimental by now. If it is to be kept, a pagezero variant utilizing
non-temporal stores still makes sense but a different function should be
created for on-demand zeroing.

That said, it should be easy to lift a lot of this code. Could you please
benchmark with memcpy as implemented above? Not saying this particular
patch is wrong, but that the bigger problem should be addressed first.

Cheers,
-- 
Mateusz Guzik 



Re: getentropy does not explicit_bzero if copyout fails

2018-02-21 Thread Mateusz Guzik
On Wed, Feb 21, 2018 at 10:28 PM, Ted Unangst <t...@tedunangst.com> wrote:

> Mateusz Guzik wrote:
> > As the subject states. By the time the code gets to copyout, buf is
> > already populated. Clearing it only if copyout succeeds looks like a
> > braino, thus the following trivial proposal:
>
> If the secret random data is not copied out, it will not be used, and
> there's
> nothing to leak. I think this just complicates the code.
>
>
It is a nit. However, leaving the buf like this gives me an uneasy
feeling and the patch is most definitely not a complication.

That said, your OS - your call.

>
> > diff --git a/sys/dev/rnd.c b/sys/dev/rnd.c
> > index e33cb5fd7c0..fa876a950b9 100644
> > --- a/sys/dev/rnd.c
> > +++ b/sys/dev/rnd.c
> > @@ -932,9 +932,9 @@ sys_getentropy(struct proc *p, void *v, register_t
> > *retval)
> > if (SCARG(uap, nbyte) > sizeof(buf))
> > return (EIO);
> > arc4random_buf(buf, SCARG(uap, nbyte));
> > -   if ((error = copyout(buf, SCARG(uap, buf), SCARG(uap, nbyte)))
> != 0)
> > -   return (error);
> > +   error = copyout(buf, SCARG(uap, buf), SCARG(uap, nbyte));
> > explicit_bzero(buf, sizeof(buf));
> > -   retval[0] = 0;
> > -   return (0);
> > +   if (error == 0)
> > +   retval[0] = 0;
> > +   return (error);
> >  }
>

-- 
Mateusz Guzik 


getentropy does not explicit_bzero if copyout fails

2018-02-21 Thread Mateusz Guzik
As the subject states. By the time the code gets to copyout, buf is
already populated. Clearing it only if copyout succeeds looks like a
braino, thus the following trivial proposal:

diff --git a/sys/dev/rnd.c b/sys/dev/rnd.c
index e33cb5fd7c0..fa876a950b9 100644
--- a/sys/dev/rnd.c
+++ b/sys/dev/rnd.c
@@ -932,9 +932,9 @@ sys_getentropy(struct proc *p, void *v, register_t
*retval)
if (SCARG(uap, nbyte) > sizeof(buf))
return (EIO);
arc4random_buf(buf, SCARG(uap, nbyte));
-   if ((error = copyout(buf, SCARG(uap, buf), SCARG(uap, nbyte))) != 0)
-   return (error);
+   error = copyout(buf, SCARG(uap, buf), SCARG(uap, nbyte));
explicit_bzero(buf, sizeof(buf));
-   retval[0] = 0;
-   return (0);
+   if (error == 0)
+   retval[0] = 0;
+   return (error);
 }


-- 
Mateusz Guzik 


Re: C mutex impl. for x86

2018-02-15 Thread Mateusz Guzik
On Wed, Dec 20, 2017 at 12:17:27PM +0100, Martin Pieuchot wrote:
> On 15/12/17(Fri) 22:03, Mateusz Guzik wrote:
> > > +void
> > > +__mtx_enter(struct mutex *mtx)
> > > +{
> > > +#ifdef MP_LOCKDEBUG
> > > + int nticks = __mp_lock_spinout;
> > > +#endif
> > > +
> > > + while (__mtx_enter_try(mtx) == 0) {
> > > + CPU_BUSY_CYCLE();
> >
> > So this is effectively __mtx_enter_try with single pause in-between.
> >
> > > +}
> > > +
> > > +int
> > > +__mtx_enter_try(struct mutex *mtx)
> > > +{
> > > + struct cpu_info *owner, *ci = curcpu();
> > > + int s;
> > > +
> > > + if (mtx->mtx_wantipl != IPL_NONE)
> > > + s = splraise(mtx->mtx_wantipl);
> > > +
> >
> > This is at least one read.
> >
> > > + owner = atomic_cas_ptr(>mtx_owner, NULL, ci);
> > > +#ifdef DIAGNOSTIC
> > > + if (__predict_false(owner == ci))
> > > + panic("mtx %p: locking against myself", mtx);
> > > +#endif
> > > + if (owner == NULL) {
> > > + membar_enter_after_atomic();
> > > + if (mtx->mtx_wantipl != IPL_NONE)
> > > + mtx->mtx_oldipl = s;
> >
> > This repeats the read done earlier.
> >
> > Since the caller loops, this is effectively a very inefficient lock of
> > the form:
> > while (!atomic_cas_ptr(...))
> > CPU_BUSY_CYCLE();
> >
> > + some reads in-between
> >

So how about the patch below. It booted fine and I did a make -j 8 of
the kernel build with it. Unfortunately the vm in question is running in
too volatile invironment for any kind of performance testing atm.
Impact should be minor anyway.

Note this is still a super basic lock with giant room for improvement.
I'm not committing to any kind of work in the area, but I may contribute
other code in the future if stuff of this sort is fine with you.

diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index 64c5a1f6319..5c14f2b6720 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -253,34 +253,14 @@ __mtx_init(struct mutex *mtx, int wantipl)
 }

 #ifdef MULTIPROCESSOR
-void
-__mtx_enter(struct mutex *mtx)
-{
-#ifdef MP_LOCKDEBUG
-   int nticks = __mp_lock_spinout;
-#endif
-
-   while (__mtx_enter_try(mtx) == 0) {
-   CPU_BUSY_CYCLE();
-
-#ifdef MP_LOCKDEBUG
-   int nticks = __mp_lock_spinout;
-#endif
-
-   while (__mtx_enter_try(mtx) == 0) {
-   CPU_BUSY_CYCLE();
-
-#ifdef MP_LOCKDEBUG
-   if (--nticks == 0) {
-   db_printf("%s: %p lock spun out", __func__, mtx);
-   db_enter();
-   nticks = __mp_lock_spinout;
-   }
-#endif
-   }
-}
-
 int
-__mtx_enter_try(struct mutex *mtx)
+__mtx_enter_common(struct mutex *mtx, int wantipl, struct cpu_info *ci)
 {
-   struct cpu_info *owner, *ci = curcpu();
+   struct cpu_info *owner;
int s;

-   if (mtx->mtx_wantipl != IPL_NONE)
-   s = splraise(mtx->mtx_wantipl);
+   if (wantipl != IPL_NONE)
+   s = splraise(wantipl);

owner = atomic_cas_ptr(>mtx_owner, NULL, ci);
 #ifdef DIAGNOSTIC
@@ -289,7 +269,7 @@ __mtx_enter_try(struct mutex *mtx)
 #endif
if (owner == NULL) {
membar_enter_after_atomic();
-   if (mtx->mtx_wantipl != IPL_NONE)
+   if (wantipl != IPL_NONE)
mtx->mtx_oldipl = s;
 #ifdef DIAGNOSTIC
ci->ci_mutex_level++;
@@ -297,11 +277,49 @@ __mtx_enter_try(struct mutex *mtx)
return (1);
}

-   if (mtx->mtx_wantipl != IPL_NONE)
+   if (wantipl != IPL_NONE)
splx(s);

return (0);
 }
+
+void
+__mtx_enter(struct mutex *mtx)
+{
+   struct cpu_info *ci = curcpu();
+   int wantipl = mtx->mtx_wantipl;
+#ifdef MP_LOCKDEBUG
+   int nticks = __mp_lock_spinout;
+#endif
+
+   for (;;) {
+   if (__mtx_enter_common(mtx, wantipl, ci))
+   break;
+
+   for (;;) {
+#ifdef MP_LOCKDEBUG
+   if (--nticks == 0) {
+   db_printf("%s: %p lock spun out", __func__,
mtx);
+   db_enter();
+   nticks = __mp_lock_spinout;
+       }
+#endif
+   CPU_BUSY_CYCLE();
+   if (mtx->mtx_owner == NULL)
+   break;
+   }
+   }
+}
+
+int
+__mtx_enter_try(struct mutex *mtx)
+{
+   struct cpu_info *ci = curcpu();
+   int wantipl = mtx->mtx_wantipl;
+
+   return (__mtx_enter_common(mtx, wantipl, ci));
+}
+
 #else
 void
 __mtx_enter(struct mutex *mtx)

-- 
Mateusz Guzik 


Re: C mutex impl. for x86

2017-12-15 Thread Mateusz Guzik
On Thu, Dec 14, 2017 at 04:06:41PM +0100, Martin Pieuchot wrote:
> Diff below moves amd64 and i386 mutex to the common C implementation.
>
> The differences are:
>   - membar_enter_after_atomic(9) instead of membar_enter(9), and
>   - membar_exit_before_atomic(9) instead of membar_exit(9)
>
> I'd appreciate any performance test to know if the performance
> degradation is acceptable with these barriers.
>

These are only compiler barriers and there should be effectively no
difference.

However, contended behaviour is a regression compared to the asm
variant.

>From what I gather this is a step towards unifying all mutex
implementations, hence the membar_* use.

> +void
> +__mtx_enter(struct mutex *mtx)
> +{
> +#ifdef MP_LOCKDEBUG
> + int nticks = __mp_lock_spinout;
> +#endif
> +
> + while (__mtx_enter_try(mtx) == 0) {
> + CPU_BUSY_CYCLE();

So this is effectively __mtx_enter_try with single pause in-between.

> +}
> +
> +int
> +__mtx_enter_try(struct mutex *mtx)
> +{
> + struct cpu_info *owner, *ci = curcpu();
> + int s;
> +
> + if (mtx->mtx_wantipl != IPL_NONE)
> + s = splraise(mtx->mtx_wantipl);
> +

This is at least one read.

> + owner = atomic_cas_ptr(>mtx_owner, NULL, ci);
> +#ifdef DIAGNOSTIC
> + if (__predict_false(owner == ci))
> + panic("mtx %p: locking against myself", mtx);
> +#endif
> + if (owner == NULL) {
> + membar_enter_after_atomic();
> + if (mtx->mtx_wantipl != IPL_NONE)
> + mtx->mtx_oldipl = s;

This repeats the read done earlier.

Since the caller loops, this is effectively a very inefficient lock of
the form:
while (!atomic_cas_ptr(...))
CPU_BUSY_CYCLE();

+ some reads in-between

Assembly code would spin waiting for the lock to become free before
playing with spl level and attempting to lock. I don't know how
contended your mutexes are right now, but this will have to be very
quickly changed at least to current asm behaviour as more of the kernel
gets unlocked. Going for ticket locks or backoff later will probably
work fine enough for the foreseeable future and will postpone the need
to invest into anything fancy.

That said, proposed restructure is as follows (pseudo-code, no debug):

void
__mtx_enter(struct mutex *mtx)
{
int want_ipl, s;

/* mark mtx_wantipl as volatile or add a read casted through
 * one to force a read *here* and no re-reads later */
want_ipl = mtx->mtx_wantipl;

for (;;) {
if (want_ipl != IPL_NONE)
s = splraise(want_ipl);
owner = atomic_cas_ptr(>mtx_owner, NULL, ci);
if (owner == NULL) {
membar_enter_after_atomic();
if (want_ipl != IPL_NONE)
mtx->mtx_oldipl = s;
break;
}

if (want_ipl != IPL_NONE)
splx(s);

do {
CPU_BUSY_CYCLE();
} while (mtx->mtx_owner != NULL);
}
}

I don't think partial duplication of try_enter can be avoided without
serious ugliness.

> +void
> +__mtx_leave(struct mutex *mtx)
> +{
> + int s;
> +
> + MUTEX_ASSERT_LOCKED(mtx);
> +
> +#ifdef DIAGNOSTIC
> + curcpu()->ci_mutex_level--;
> +#endif
> +
> + s = mtx->mtx_oldipl;
> +#ifdef MULTIPROCESSOR
> + membar_exit_before_atomic();
> +#endif
> + mtx->mtx_owner = NULL;
> + if (mtx->mtx_wantipl != IPL_NONE)
> + splx(s);
> +}

It should be somewhat faster to read mtx_wantipl while the lock is still
held - it is less likely that someone else dirtied the cacheline. Then
mtx_oldipl can be only conditionally read.

This function does not do atomic ops, so membar_exit_before_atomic does
not really fit, even though it happens to do the exact same thing the
"correctly" named primitive would - just provide a compiler barrier on
amd64/i386.

>From what I gather you are trying to mimick illumos nomenclature, but
they don't have an equivalent afaics. (perhaps Solaris grew one in the
meantime?)

In FreeBSD an appropriate routine is named atomic_thread_fence_rel (see
amd64/include/atomic.h) and I suggest just borrowing the api.

Side note is that you probably can shorten ipl to vars to make the lock
smaller. It can be doable to fit it into lock word, but I don't know how
much
sense would playing with it make.

-- 
Mateusz Guzik 


Re: rw locks vs memory barriers and adaptive spinning

2017-10-11 Thread Mateusz Guzik
On Wed, Oct 11, 2017 at 03:15:42PM +0200, Martin Pieuchot wrote:
> On 10/10/17(Tue) 20:19, Mateusz Guzik wrote:
> > On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote:
> > > Hello Mateusz,
> > >
> > > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> > > > I was looking at rw lock code out of curiosity and noticed you always do
> > > > membar_enter which on MP-enabled amd64 kernel translates to mfence.
> > > > This makes the entire business a little bit slower.
> > > >
> > > > Interestingly you already have relevant macros for amd64:
> > > > #define membar_enter_after_atomic() __membar("")
> > > > #define membar_exit_before_atomic() __membar("")
> > >
> > > If you look at the history, you'll see that theses macro didn't exist
> > > when membar_* where added to rw locks.
> > >
> > 
> > I know. Addition of such a macro sounds like an accelent opportunity to
> > examine existing users. I figured rw locks were ommitted by accident
> > as the original posting explicitly mentions mutexes.
> > 
> > https://marc.info/?l=openbsd-tech=149555411827749=2
> > 
> > > > And there is even a default variant for archs which don't provide one.
> > > > I guess the switch should be easy.
> > >
> > > Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
> > > enough.
> > >
> > > Do it, test it, explain it, get oks and commit it 8)
> > >
> > 
> > As noted earlier the patch is trivial. I have no easy means to even
> > compile-test it though as I'm not using OpenBSD. Only had a look out of
> > curiosity.
> 
> Why don't you start using it?  Testing is what makes the difference.
> Many of us have ideas of how to improve the kernel but we're lacking
> the time to test & debug them all.
> 

I'm playing with concurrency as a hobby and smpifying single-threaded
code is above the pain threshold I'm willing to accept in my spare time.

> Sometimes a "correct" change exposes a bug and we try not to break
> -current, so we cannot afford to commit a non-tested diff.
> 

I definitely don't advocate for just committing stuff. I do state though
that there should be no binary change for archs other than i386 and
amd64 (== nothing to test there). Given the nature of the change for
these two it really seemed it just fell through the cracks earlier and I
just wanted to point it out as perhaps someone will find cycles to pick
it up and test on relevant platforms.

> > Nonetheless, here is the diff which can be split in 2. One part deals
> > with barriers, another one cleans up rw_status.
> >
> > [...]
> >
> > Read from the lock only once in rw_status. The lock word is marked as
> > volatile which forces re-read on each access. There is no correctness
> > issue here, but the assembly is worse than it needs to be and in case
> > of contention this adds avoidable cache traffic.
> 
> Here's the diff to cleans rw_status.  The difference on amd64 with
> clang 4.0  is the following:
> 
>   48 8b 0fmov(%rdi),%rcx
>   f6 c1 04test   $0x4,%cl
>   75 0c   jne738 <rrw_status+0x18>
>   31 c0   xor%eax,%eax
> - 48 83 3f 00 cmpq   $0x0,(%rdi)
> + 48 85 c9test   %rcx,%rcx
> 
> ok?


There should be another change past rrw_status+0x18:
812557e8:   65 48 8b 04 25 08 00mov%gs:0x8,%rax
812557ef:   00 00
812557f1:   48 8b 80 90 02 00 00mov0x290(%rax),%rax
812557f8:   48 33 07xor(%rdi),%rax

> 
> Index: kern/kern_rwlock.c
> ===
> RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 kern_rwlock.c
> --- kern/kern_rwlock.c12 Aug 2017 23:27:44 -  1.30
> +++ kern/kern_rwlock.c11 Oct 2017 12:59:19 -
> @@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS
>  int
>  rw_status(struct rwlock *rwl)
>  {
> - if (rwl->rwl_owner & RWLOCK_WRLOCK) {
> - if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner))
> + unsigned long owner = rwl->rwl_owner;
> +
> + if (owner & RWLOCK_WRLOCK) {
> + if (RW_PROC(curproc) == RW_PROC(owner))
>   return RW_WRITE;
>   else
>   return RW_WRITE_OTHER;
>   }
> - if (rwl->rwl_owner)
> + if (owner)
>   return RW_READ;
>   return (0);
>  }

-- 
Mateusz Guzik



Re: rw locks vs memory barriers and adaptive spinning

2017-10-10 Thread Mateusz Guzik
On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote:
> Hello Mateusz,
>
> On 09/10/17(Mon) 21:43, Mateusz Guzik wrote:
> > I was looking at rw lock code out of curiosity and noticed you always do
> > membar_enter which on MP-enabled amd64 kernel translates to mfence.
> > This makes the entire business a little bit slower.
> >
> > Interestingly you already have relevant macros for amd64:
> > #define membar_enter_after_atomic() __membar("")
> > #define membar_exit_before_atomic() __membar("")
>
> If you look at the history, you'll see that theses macro didn't exist
> when membar_* where added to rw locks.
>

I know. Addition of such a macro sounds like an accelent opportunity to
examine existing users. I figured rw locks were ommitted by accident
as the original posting explicitly mentions mutexes.

https://marc.info/?l=openbsd-tech=149555411827749=2

> > And there is even a default variant for archs which don't provide one.
> > I guess the switch should be easy.
>
> Then please do it :)  At lot of stuff is easy on OpenBSD but we are not
> enough.
>
> Do it, test it, explain it, get oks and commit it 8)
>

As noted earlier the patch is trivial. I have no easy means to even
compile-test it though as I'm not using OpenBSD. Only had a look out of
curiosity.

Nonetheless, here is the diff which can be split in 2. One part deals
with barriers, another one cleans up rw_status.

This should not result in binary change on any arch apart from i386 and
amd64. On these 2 if there is a breakage, it's a braino of some sort.
The change in principle is definitely fine.

So:

Utilize membar_*{before,after}_atomic in rw locks.

On architectures whose atomic operations provide relevant barriers there
is no need for additional membar. In particular this is the case on
amd64 and i386.

Read from the lock only once in rw_status. The lock word is marked as
volatile which forces re-read on each access. There is no correctness
issue here, but the assembly is worse than it needs to be and in case
of contention this adds avoidable cache traffic.

diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index e2c3cae..5551835 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -96,7 +96,7 @@ _rw_enter_read(struct rwlock *rwl LOCK_FL_VARS)
rw_cas(>rwl_owner, owner, owner + RWLOCK_READ_INCR)))
_rw_enter(rwl, RW_READ LOCK_FL_ARGS);
else {
-   membar_enter();
+   membar_enter_after_atomic();
WITNESS_CHECKORDER(>rwl_lock_obj, LOP_NEWORDER, file,
line,
NULL);
WITNESS_LOCK(>rwl_lock_obj, 0, file, line);
@@ -112,7 +112,7 @@ _rw_enter_write(struct rwlock *rwl LOCK_FL_VARS)
RW_PROC(p) | RWLOCK_WRLOCK)))
_rw_enter(rwl, RW_WRITE LOCK_FL_ARGS);
else {
-   membar_enter();
+   membar_enter_after_atomic();
WITNESS_CHECKORDER(>rwl_lock_obj,
LOP_EXCLUSIVE | LOP_NEWORDER, file, line, NULL);
WITNESS_LOCK(>rwl_lock_obj, LOP_EXCLUSIVE, file, line);
@@ -126,7 +126,7 @@ _rw_exit_read(struct rwlock *rwl LOCK_FL_VARS)

rw_assert_rdlock(rwl);

-   membar_exit();
+   membar_exit_before_atomic();
if (__predict_false((owner & RWLOCK_WAIT) ||
rw_cas(>rwl_owner, owner, owner - RWLOCK_READ_INCR)))
_rw_exit(rwl LOCK_FL_ARGS);
@@ -141,7 +141,7 @@ _rw_exit_write(struct rwlock *rwl LOCK_FL_VARS)

rw_assert_wrlock(rwl);

-   membar_exit();
+   membar_exit_before_atomic();
if (__predict_false((owner & RWLOCK_WAIT) ||
rw_cas(>rwl_owner, owner, 0)))
_rw_exit(rwl LOCK_FL_ARGS);
@@ -261,7 +261,7 @@ retry:

if (__predict_false(rw_cas(>rwl_owner, o, o + inc)))
goto retry;
-   membar_enter();
+   membar_enter_after_atomic();

/*
 * If old lock had RWLOCK_WAIT and RWLOCK_WRLOCK set, it means we
@@ -295,7 +295,7 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS)
WITNESS_UNLOCK(>rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0,
file, line);

-   membar_exit();
+   membar_exit_before_atomic();
do {
owner = rwl->rwl_owner;
if (wrlock)
@@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS)
 int
 rw_status(struct rwlock *rwl)
 {
-   if (rwl->rwl_owner & RWLOCK_WRLOCK) {
-   if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner))
+   unsigned long owner = rwl->rwl_owner;
+
+   if (owner & RWLOCK_WRLOCK) {
+   if (RW_PROC(curproc) == RW_PROC(owner))
return RW_WRITE;
else
        return RW_WRITE_OTHER;
}
-   if (rwl->rwl_owner)
+   if (owner)
return RW_READ;
return (0);
 }

-- 
Mateusz Guzik 


rw locks vs memory barriers and adaptive spinning

2017-10-09 Thread Mateusz Guzik
Hello,

I was looking at rw lock code out of curiosity and noticed you always do
membar_enter which on MP-enabled amd64 kernel translates to mfence.
This makes the entire business a little bit slower.

Interestingly you already have relevant macros for amd64:
#define membar_enter_after_atomic() __membar("")
#define membar_exit_before_atomic() __membar("")

And there is even a default variant for archs which don't provide one.
I guess the switch should be easy.

Grabbing for reading is rw_cas to a higher value. On failure you
explicitly re-read from the lock  This is slower than necessary in
presence of concurrent read lock/unlock since cas returns the found
value and you can use that instead.

Also the read lock fast path does not have to descent to the slow path
after a single cas failure, but this probably does not matter right now.

The actual question I have here is if you played with adaptive spinning
instead of instantly putting threads to sleep at least for cases when
the kernel lock is not held. This can be as simplistic as just spinning
as long as the lock is owned by a running thread.

For cases where curproc holds the kernel lock you can perhaps drop it
for spinning purposes and reacquire later. Although I have no idea if
this one is going to help anything. Definitely worth testing imo.

A side note is that the global locks I found are not annotated in any
manner with respect to exclusivity of cacheline placement. In particular
netlock in a 6.2 kernel shares its chaceline with if_input_task_locked:


81aca608 D if_input_task_locked
81aca630 D netlock


The standard boiler plate to deal with it is to annotate with aligned()
and place the variable in a dedicated section. FreeBSD and NetBSD
contain the necessary crappery to copy-paste including linker script
support.

Cheers,
-- 
Mateusz Guzik 


Re: sendsyslog file race

2017-03-27 Thread Mateusz Guzik
On Mon, Mar 27, 2017 at 6:09 PM, Alexander Bluhm <alexander.bl...@gmx.net>
wrote:

> On Mon, Mar 27, 2017 at 05:39:27PM +0200, Mateusz Guzik wrote:
> > The previous patch replaced multiple reads of the global var with just
> > one read and had the result stored in a local variable, which then is
> > read multiple times. Even though the compiler ended up emitting one read
> > of the global, it perfectly legal to emit several, thus the bug can still
> > reappear. This can be prevented as described above.
>
> It is not a problem when the compiler creates multiple read
> instructions for syslogf as long no function call is between them.
> Nothing else can change the content of syslogf in parallel, we are
> running with the big kernel lock.
>
> It would be a problem if the compiler would do a read from the
> global syslogf variable before and after a sleep.  But that would
> be illegal in C.


Agreed. I somehow brainfarted myself into not realising it really can't
reload the var across a func call which would be needed to give cpu to
someone else.

Sorry for the noise in that bit.

-- 
Mateusz Guzik 


Re: sendsyslog file race

2017-03-27 Thread Mateusz Guzik
On Sun, Mar 26, 2017 at 10:04 PM, Alexander Bluhm <alexander.bl...@gmx.net>
wrote:

> On Sun, Mar 26, 2017 at 05:00:12PM +0200, Mateusz Guzik wrote:
> > The patch is somewhat incorrect, although from what I checked it happens
> > to generate the expected outcome in terms of assembly (the global pointer
> > read only once and then a local copy accessed several times). You either
> > need a linux-like READ_ONCE macros which casts through a volatile pointer
> > or mark the global as volatile to prevent the compiler from issuing
> > multiple reads in the future.
>
> Note that our OpenBSD kernel is still implemented with a big kernel
> lock in most places.  So multi processor thinking and locking is
> not necessary.  The execution order can only be interrupted by
> hardware interrups or explicitly rescheduling with tsleep(9).  Here
> the sleep is hidden in copyin(), mallocarray(), sosend(), malloc(),
> ktrgenio().
>
> Interrupts are not relevant, they never change syslogf.  As tsleep()
> is a function call, it automatically acts as compiler barrier.  So
> volatile is not needed.
>
>
The previous patch replaced multiple reads of the global var with just
one read and had the result stored in a local variable, which then is
read multiple times. Even though the compiler ended up emitting one read
of the global, it perfectly legal to emit several, thus the bug can still
reappear. This can be prevented as described above.


> > Remaining ses sof syslogf are also super fishy:
> > 1. logclose
> > 2. logioctl -> LIOCSFD
> > FRELE(syslogf, p);
>
> A few months ago I would have said FRELE() does not sleep.  I think
> this is currently still the case as we do not grab the netlock for
> socketpairs.  But we did for a while.
>
> As we are heading towards multi processor in kernel, I would suggest
> this diff.  It orders FREF() and FRELE() in a way that we only
> operate on refcounted global variables.  Although not necessary
> with the current sleeping points, I think it results in more robust
> code.


With aforementioned caveat ignored, the patch indeed fixes the problem.

I skimmed through and found few *sleep calls, but perhaps they cannot
end up being executed for the type of socket accepted here.

-- 
Mateusz Guzik 


Re: sendsyslog file race

2017-03-26 Thread Mateusz Guzik
On Fri, Mar 24, 2017 at 4:56 PM, Alexander Bluhm 
wrote:

> Hi,
>
> There is a race in dosendsyslog() which resulted in a crash on a
> 5.9 system.  sosend(syslogf->f_data, ...) was called with a NULL
> pointer.  So syslogf is not NULL, f_data is NULL and f_count is 1.
>
> The file structure is ref counted, but the global variable syslogf
> is not protected.  So it may change during sleep and dosendsyslog()
> possibly uses a different socket at each access.
>
> Solution is to access syslogf ony once, use a local copy, and do
> the ref counting there.
>

I'm sending this from the gmail web interface, so formatting may be screwed
up. Apologies in advance.

The patch is somewhat incorrect, although from what I checked it happens
to generate the expected outcome in terms of assembly (the global pointer
read only once and then a local copy accessed several times). You either
need a linux-like READ_ONCE macros which casts through a volatile pointer
or mark the global as volatile to prevent the compiler from issuing
multiple reads in the future.

Remaining ses sof syslogf are also super fishy:

1. logclose

if (syslogf)
FRELE(syslogf, p);
syslogf = NULL;

If this results in fdrop I presume it can block. But if that happens, the
global points to a defunct object.

2. logioctl -> LIOCSFD

if (syslogf)
FRELE(syslogf, p);
syslogf = fp;

This one will give double free and ref leaks.

Suggested fix is to xchg the pointer. While it is more expensive than
necessary for this case, it does not pose a problem. Also xchg is
blatantly obvious, while manual replacement would require explicit memory
barriers to be placed here.


Re: "user" chroot (patch)

2017-03-11 Thread Mateusz Guzik
On Sat, Mar 11, 2017 at 05:02:14AM +, Kristaps Dzonsons wrote:
> In running risky non-root applications, it'd be great to chroot(2)
> without needing to be root.  But the manpage says no.  So I added a
> system call, uchroot(2), that does the following:
> 
>  (1) performs the change-root w/o checking for root
>  (2) disables setuid (flag is inherited)
> 
> The (2) plugs the hole wherein a non-root normal chroot could allow for
> privilege escalation by a cooked master.passwd and setuid.
> 
> Enclosed is a patch.  (Without the bits generated from make syscalls: I
> don't know whether that's preferred.  Also doesn't include the libc
> parts.)  It can't be this easy, can it?  What did I miss?
> 

fd tables can be shared between processes (see FORK_SHAREFILES). So in
particular if one process uchroots and another one performs execve, the
execve is done against the chroot while "ignore suid" is not set.

While this bit is easily fixable I think this is too hairy in general
and the feature is unnecessary for the intended purpose.

> My initial motivation is CGI scripts, which can and do pledge, yes, but
> I'd like to limit *path-pledged processes to a directory so scripts
> can't read everything else in /var/www.  For example, within /var/www,
> to (say) /vhosts/foo.com.
> 

I think the way to go is "will only use lookups against a file
descriptor" pledge. Nice added value is that programs can now use
different places in the tree with directory granulalrity as opposed to
having to chroot to the common parent.

This poses a problem with confining ".." lookups.  There is a hack in
FreeBSD to explicitly track them, but perhaps you will be fine enough
with disallowing ".."s in the first place.

-- 
Mateusz Guzik 



Re: NET_LOCK() pr_sysctl

2017-01-16 Thread Mateusz Guzik
On Mon, Jan 16, 2017 at 07:34:43PM +, Alexander Bluhm wrote:
> On Mon, Jan 09, 2017 at 11:55:55PM +0100, Alexander Bluhm wrote:
> > On Thu, Dec 22, 2016 at 01:38:17AM +0100, Mateusz Guzik wrote:
> > > In this particular case, what happens if the access results in a page
> > > fault and the area comes from a nfs mapped file? If network i/o is done
> > > from the same context, this should result in 'locking against myself'
> > > assertion failure.
> > 
> > I have written a program the sets a sysctl value from a memory
> > mapped file mounted on NFS.  As expected it panics when NET_LOCK()
> > is enabled.
> 
> I was wondering why my test program did only crash during copyin
> but never for copyout.  For sysctl(2) the copyout does never sleep
> as the memory is wired.  This is done to avoid races when collecting
> data for the oldp.

Once more, I'm not familiar with the codebase, only skimmed through so
apologies for mistakes (if any).

I think the current code is buggy and the approach needs to be adjusted
to work.

On a more SMP-ified kernel a concurrently running thread can munlock the
area or mmap something over it. Then vsunlock executed over it will
cause trouble. The same checks used during munlock have to be repeated.

Even the current kernel has the problem - since NET_LOCK is a sleepable
lock, the caller can be put off the cpu and the newly scheduled thread
can do aforementioned shenaningans. Likely there is a similar story for
other sysctl handlers.

With spurious vsunlock taken care off, we are back to recursive locking
due to the area not faulted in.

If you want to stick to holding NET_LOCK over copyin/copyout, I suggest
introducing _nofault variants - that is, instead of faulting the page
in, just return an error. The kernel did what it should by wiring
appropriate pages. If you play games by changing the same mappings, the
failure is on you. I.e. it will not affect legitimate programs.

However, this slows down the general sysctl interface for a corner case.
At the very least, a dedicated helper for wiring pages can be introduced
and it would be executed by sysctls interested in the facility.


> 
> If I implement the same trick for newp, I can avoid the "netlock
> locking against myself" with sysctl on memory mapped files over
> NFS.  Of course other copyin/copyout paths like pf(4) ioctl(2) still
> have to be checked.  IPsec pfkey seem to use the sysctl mechanism.
> 
> Note that the variable dolock was always 1, so I removed it.
> 
> ok?
> 
> bluhm
> 
> Index: kern/kern_sysctl.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_sysctl.c,v
> retrieving revision 1.320
> diff -u -p -r1.320 kern_sysctl.c
> --- kern/kern_sysctl.c11 Nov 2016 18:59:09 -  1.320
> +++ kern/kern_sysctl.c16 Jan 2017 18:49:20 -
> @@ -157,7 +157,7 @@ sys_sysctl(struct proc *p, void *v, regi
>   syscallarg(void *) new;
>   syscallarg(size_t) newlen;
>   } */ *uap = v;
> - int error, dolock = 1;
> + int error;
>   size_t savelen = 0, oldlen = 0;
>   sysctlfn *fn;
>   int name[CTL_MAXNAME];
> @@ -219,30 +219,41 @@ sys_sysctl(struct proc *p, void *v, regi
>   if (SCARG(uap, oldlenp) &&
>   (error = copyin(SCARG(uap, oldlenp), , sizeof(oldlen
>   return (error);
> - if (SCARG(uap, old) != NULL) {
> + if (SCARG(uap, old) != NULL || SCARG(uap, new) != NULL) {
>   if ((error = rw_enter(_lock, RW_WRITE|RW_INTR)) != 0)
>   return (error);
> - if (dolock) {
> - if (atop(oldlen) > uvmexp.wiredmax - uvmexp.wired) {
> - rw_exit_write(_lock);
> - return (ENOMEM);
> - }
> - error = uvm_vslock(p, SCARG(uap, old), oldlen,
> - PROT_READ | PROT_WRITE);
> - if (error) {
> - rw_exit_write(_lock);
> - return (error);
> - }
> + }
> + if (SCARG(uap, old) != NULL) {
> + if (atop(oldlen) > uvmexp.wiredmax - uvmexp.wired) {
> + error = ENOMEM;
> + goto unlock;
>   }
> + error = uvm_vslock(p, SCARG(uap, old), oldlen,
> + PROT_READ | PROT_WRITE);
> + if (error)
> + goto unlock;
>   savelen = oldlen;
>   }
> + if (SCARG(uap, new) != NULL) {
> + if (atop(SCARG(uap, newlen)) > uvmexp.wiredmax - uvmexp.wired) {
> + erro

Re: NET_LOCK() pr_sysctl

2016-12-21 Thread Mateusz Guzik
On Tue, Dec 20, 2016 at 05:37:20PM +, Alexander Bluhm wrote:
> Obviosly a NET_LOCK() is missing in tcp_sysctl().
> 
> I think it is better to place the lock into net_sysctl() where all
> the protocol sysctls are called via pr_sysctl.  Then we don't have
> to decide each case individually.  As calling sysctl(2) is in the
> slow path, doing fine grained locking has no benefit.  Many sysctl
> cases copy out a struct.  Having a lock around that keeps the struct
> consistent.
> 

Holding locks across copyout/copyin is always fishy.

In this particular case, what happens if the access results in a page
fault and the area comes from a nfs mapped file? If network i/o is done
from the same context, this should result in 'locking against myself'
assertion failure.

That said, I'm not exactly familiar with the area, so maybe that's a
false alarm.

-- 
Mateusz Guzik