Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On 20/06/13(Thu) 15:18, Martin Pelikan wrote: And p_sigmask is copied during fork, so that needs a bit of thought. I guess it doesn't matter for the new child, as it isn't running yet and therefore can't invoke sigprocmask(2). And the parent should be safe as well, as it is in fork(2) and therefore can't call sigprocmask(2) either. memcpy(p_startcopy) which is, tadaaa, p_sigmask! From implementations of memcpy I've seen (usage of SSE or non-temporal instructions to begin with) it seems that no guarantee about how big a chunk does it copy atomically can be assumed... So a multithreaded process doing sigprocmask(2) in one thread and simultaneously fork(2) in the other? Not a problem either, because 1003.1 2004 edition says that sigprocmask(2) is equivalent to pthread_sigmask(3) and the fork only copies the other thread, not bothering with the sigprocmask-ing one. I haven't checked the code that thoroughly but this is what POSIX says. There's also a couple of additional p-p_sigmask |= ... in kern_sig.c. All of these need to become atomic_setbits_int/atomic_clearbits_int calls. Does the third one do the charm? I haven't compiled the other arch's, but i386 is running fine. So what's going on with this diff? fwiw I tried it on macppc and amd64. I'm also wondering, don't you think ptsignal() needs to be modified as well? Because it might access p_sigmask various times, so are you sure your diff won't introduce any side effect if the mask is modified in between? Or is it simply impossible? -- Martin Pelikan Index: arch/hp300/hp300/trap.c === RCS file: /cvs/src/sys/arch/hp300/hp300/trap.c,v retrieving revision 1.62 diff -u -p -r1.62 trap.c --- arch/hp300/hp300/trap.c 31 Dec 2012 06:46:13 - 1.62 +++ arch/hp300/hp300/trap.c 20 Jun 2013 12:51:46 - @@ -321,7 +321,7 @@ dopanic: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: arch/mvme68k/mvme68k/trap.c === RCS file: /cvs/src/sys/arch/mvme68k/mvme68k/trap.c,v retrieving revision 1.77 diff -u -p -r1.77 trap.c --- arch/mvme68k/mvme68k/trap.c 31 Dec 2012 06:46:13 - 1.77 +++ arch/mvme68k/mvme68k/trap.c 20 Jun 2013 12:51:47 - @@ -260,7 +260,7 @@ copyfault: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: compat/linux/linux_signal.c === RCS file: /cvs/src/sys/compat/linux/linux_signal.c,v retrieving revision 1.15 diff -u -p -r1.15 linux_signal.c --- compat/linux/linux_signal.c 19 Jun 2012 11:35:29 - 1.15 +++ compat/linux/linux_signal.c 20 Jun 2013 12:51:49 - @@ -584,7 +584,6 @@ linux_sys_sigprocmask(p, v, retval) linux_old_sigset_t ss; sigset_t bs; int error = 0; - int s; *retval = 0; @@ -604,19 +603,18 @@ linux_sys_sigprocmask(p, v, retval) linux_old_to_bsd_sigset(ss, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -624,8 +622,6 @@ linux_sys_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -644,7 +640,6 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_sigset_t ls; sigset_t bs; int error = 0; - int s; if (SCARG(uap, sigsetsize) != sizeof(linux_sigset_t)) return (EINVAL); @@ -667,19 +662,18 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_to_bsd_sigset(ls, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; +
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
Date: Fri, 8 Nov 2013 12:34:42 +0100 From: Martin Pieuchot mpieuc...@nolizard.org On 20/06/13(Thu) 15:18, Martin Pelikan wrote: And p_sigmask is copied during fork, so that needs a bit of thought. I guess it doesn't matter for the new child, as it isn't running yet and therefore can't invoke sigprocmask(2). And the parent should be safe as well, as it is in fork(2) and therefore can't call sigprocmask(2) either. memcpy(p_startcopy) which is, tadaaa, p_sigmask! From implementations of memcpy I've seen (usage of SSE or non-temporal instructions to begin with) it seems that no guarantee about how big a chunk does it copy atomically can be assumed... So a multithreaded process doing sigprocmask(2) in one thread and simultaneously fork(2) in the other? Not a problem either, because 1003.1 2004 edition says that sigprocmask(2) is equivalent to pthread_sigmask(3) and the fork only copies the other thread, not bothering with the sigprocmask-ing one. I haven't checked the code that thoroughly but this is what POSIX says. There's also a couple of additional p-p_sigmask |= ... in kern_sig.c. All of these need to become atomic_setbits_int/atomic_clearbits_int calls. Does the third one do the charm? I haven't compiled the other arch's, but i386 is running fine. So what's going on with this diff? fwiw I tried it on macppc and amd64. I'm also wondering, don't you think ptsignal() needs to be modified as well? Because it might access p_sigmask various times, so are you sure your diff won't introduce any side effect if the mask is modified in between? Or is it simply impossible? I came to the conclusion that in its current form, this diff is *not* safe :(. There is indeed a problem in ptsignal(). Essentially there is a race between deciding to which thread to deliver the signal to and the actual delivery. This particularly applies to the case where you are blocking a signal; unblocking a signal is probably safe. Index: arch/hp300/hp300/trap.c === RCS file: /cvs/src/sys/arch/hp300/hp300/trap.c,v retrieving revision 1.62 diff -u -p -r1.62 trap.c --- arch/hp300/hp300/trap.c 31 Dec 2012 06:46:13 - 1.62 +++ arch/hp300/hp300/trap.c 20 Jun 2013 12:51:46 - @@ -321,7 +321,7 @@ dopanic: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: arch/mvme68k/mvme68k/trap.c === RCS file: /cvs/src/sys/arch/mvme68k/mvme68k/trap.c,v retrieving revision 1.77 diff -u -p -r1.77 trap.c --- arch/mvme68k/mvme68k/trap.c 31 Dec 2012 06:46:13 - 1.77 +++ arch/mvme68k/mvme68k/trap.c 20 Jun 2013 12:51:47 - @@ -260,7 +260,7 @@ copyfault: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: compat/linux/linux_signal.c === RCS file: /cvs/src/sys/compat/linux/linux_signal.c,v retrieving revision 1.15 diff -u -p -r1.15 linux_signal.c --- compat/linux/linux_signal.c 19 Jun 2012 11:35:29 - 1.15 +++ compat/linux/linux_signal.c 20 Jun 2013 12:51:49 - @@ -584,7 +584,6 @@ linux_sys_sigprocmask(p, v, retval) linux_old_sigset_t ss; sigset_t bs; int error = 0; - int s; *retval = 0; @@ -604,19 +603,18 @@ linux_sys_sigprocmask(p, v, retval) linux_old_to_bsd_sigset(ss, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -624,8 +622,6 @@ linux_sys_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -644,7 +640,6 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_sigset_t ls; sigset_t bs; int error = 0; - int s; if (SCARG(uap, sigsetsize) != sizeof(linux_sigset_t)) return (EINVAL); @@ -667,19 +662,18 @@
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On Wed, Jun 19, 2013 at 07:39:15PM +, Miod Vallat wrote: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. Holy Pumpkin forbid, struct proc layout is sane enough for fields to be aligned to their natural alignment, and the allocator will return properly aligned structs as well. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. If such a function exists, it should be shot, if only for having a too long name. Miod But watch out, as kettenis@ already mentioned elsewhere, this can only be safe if *all* reads and modifications to the mask are done in an atomic way. Straight read and assignment of int values is safe, afaik, but e.g. struct assignment and memcpy are *not* atomic. -Otto
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On 2013/06/20 09:38, Otto Moerbeek wrote: On Wed, Jun 19, 2013 at 07:39:15PM +, Miod Vallat wrote: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. Holy Pumpkin forbid, struct proc layout is sane enough for fields to be aligned to their natural alignment, and the allocator will return properly aligned structs as well. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. If such a function exists, it should be shot, if only for having a too long name. Miod But watch out, as kettenis@ already mentioned elsewhere, this can only be safe if *all* reads and modifications to the mask are done in an atomic way. Straight read and assignment of int values is safe, afaik, but e.g. struct assignment and memcpy are *not* atomic. I spotted a few places where p_sigmask is = or |='d, for example these from linux compat, might these be an issue? p-p_sigmask = ~bs; p-p_sigmask |= bs ~sigcantmask;
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
I think this is only changed on the context of curproc, so there will only be one dude modifying it at a time. Problem I see is if there is other code which reads p_sigmask twice, since it losed the atomicity between two reads, don't know what can happen. This was exactly my point. Now that we've established aligned int writes are fine, there's the wrong assumption that two reads from that memory (even created by the compiler) must have to have the same value. On a quick look it doesn't seem to happen. Looking at the code more closely: - DIAGNOSTIC panic(postsig action) could be triggered due to this (maybe not, I haven't checked where is postsig() called from and if the user has a chance to change the sigmask, but I don't see the point of that panic there -- what do others think?) - I completely forgot about compat_linux syscalls. This new version fixes them as well (yes the code is duplicated, do we want to change that?) and then there was one useless splhigh(). Also, is there any situation where reading the old value is not ok ? this would only happen if another cpu tries to check p_sigmask of something else than curproc. It seems there isn't. -- Martin Pelikan Index: compat/linux/linux_signal.c === RCS file: /cvs/src/sys/compat/linux/linux_signal.c,v retrieving revision 1.15 diff -u -p -r1.15 linux_signal.c --- compat/linux/linux_signal.c 19 Jun 2012 11:35:29 - 1.15 +++ compat/linux/linux_signal.c 20 Jun 2013 10:12:40 - @@ -584,7 +584,6 @@ linux_sys_sigprocmask(p, v, retval) linux_old_sigset_t ss; sigset_t bs; int error = 0; - int s; *retval = 0; @@ -604,19 +603,18 @@ linux_sys_sigprocmask(p, v, retval) linux_old_to_bsd_sigset(ss, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -624,8 +622,6 @@ linux_sys_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -644,7 +640,6 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_sigset_t ls; sigset_t bs; int error = 0; - int s; if (SCARG(uap, sigsetsize) != sizeof(linux_sigset_t)) return (EINVAL); @@ -667,19 +662,18 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_to_bsd_sigset(ls, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -687,8 +681,6 @@ linux_sys_rt_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -727,16 +719,13 @@ linux_sys_sigsetmask(p, v, retval) } */ *uap = v; linux_old_sigset_t mask; sigset_t bsdsig; - int s; bsd_to_linux_old_sigset(p-p_sigmask, (linux_old_sigset_t *)retval); mask = SCARG(uap, mask); bsd_to_linux_old_sigset(bsdsig, mask); - s = splhigh(); p-p_sigmask = bsdsig ~sigcantmask; - splx(s); return (0); } Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.145 diff -u -p -r1.145 init_sysent.c --- kern/init_sysent.c 9 Jun 2013 13:10:27 - 1.145 +++ kern/init_sysent.c 20 Jun 2013 10:12:41 - @@ -136,7 +136,7 @@ struct sysent sysent[] = { sys_sigaction },/* 46 = sigaction */ { 0, 0, 0, sys_getgid }, /* 47 = getgid */ - { 2, s(struct sys_sigprocmask_args), 0, + { 2, s(struct sys_sigprocmask_args), SY_NOLOCK | 0, sys_sigprocmask }, /* 48 = sigprocmask */ { 2, s(struct sys_getlogin_args), 0, sys_getlogin }, /* 49 = getlogin */ Index: kern/kern_sig.c === RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.152 diff -u -p -r1.152 kern_sig.c --- kern/kern_sig.c 1 Jun 2013
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On Thu, Jun 20, 2013 at 12:35:32PM +0200, Martin Pelikan wrote: I think this is only changed on the context of curproc, so there will only be one dude modifying it at a time. Problem I see is if there is other code which reads p_sigmask twice, since it losed the atomicity between two reads, don't know what can happen. This was exactly my point. Now that we've established aligned int writes are fine, there's the wrong assumption that two reads from that memory (even created by the compiler) must have to have the same value. On a quick look it doesn't seem to happen. Looking at the code more closely: - DIAGNOSTIC panic(postsig action) could be triggered due to this (maybe not, I haven't checked where is postsig() called from and if the user has a chance to change the sigmask, but I don't see the point of that panic there -- what do others think?) - I completely forgot about compat_linux syscalls. This new version fixes them as well (yes the code is duplicated, do we want to change that?) and then there was one useless splhigh(). Also, is there any situation where reading the old value is not ok ? this would only happen if another cpu tries to check p_sigmask of something else than curproc. It seems there isn't. -- Martin Pelikan Index: compat/linux/linux_signal.c === RCS file: /cvs/src/sys/compat/linux/linux_signal.c,v retrieving revision 1.15 diff -u -p -r1.15 linux_signal.c --- compat/linux/linux_signal.c 19 Jun 2012 11:35:29 - 1.15 +++ compat/linux/linux_signal.c 20 Jun 2013 10:12:40 - @@ -584,7 +584,6 @@ linux_sys_sigprocmask(p, v, retval) linux_old_sigset_t ss; sigset_t bs; int error = 0; - int s; *retval = 0; @@ -604,19 +603,18 @@ linux_sys_sigprocmask(p, v, retval) linux_old_to_bsd_sigset(ss, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); I'm pretty sure this is not correct with your change to bs before the switch statement. Same for the similar code below. break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -624,8 +622,6 @@ linux_sys_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -644,7 +640,6 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_sigset_t ls; sigset_t bs; int error = 0; - int s; if (SCARG(uap, sigsetsize) != sizeof(linux_sigset_t)) return (EINVAL); @@ -667,19 +662,18 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_to_bsd_sigset(ls, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -687,8 +681,6 @@ linux_sys_rt_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -727,16 +719,13 @@ linux_sys_sigsetmask(p, v, retval) } */ *uap = v; linux_old_sigset_t mask; sigset_t bsdsig; - int s; bsd_to_linux_old_sigset(p-p_sigmask, (linux_old_sigset_t *)retval); mask = SCARG(uap, mask); bsd_to_linux_old_sigset(bsdsig, mask); - s = splhigh(); p-p_sigmask = bsdsig ~sigcantmask; - splx(s); return (0); } Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.145 diff -u -p -r1.145 init_sysent.c --- kern/init_sysent.c9 Jun 2013 13:10:27 - 1.145 +++ kern/init_sysent.c20 Jun 2013 10:12:41 - @@ -136,7 +136,7 @@ struct sysent sysent[] = { sys_sigaction },/* 46 = sigaction */ { 0, 0, 0, sys_getgid }, /* 47 = getgid */ - { 2, s(struct sys_sigprocmask_args), 0, + { 2, s(struct sys_sigprocmask_args), SY_NOLOCK | 0, sys_sigprocmask }, /* 48 = sigprocmask */ { 2, s(struct sys_getlogin_args), 0, sys_getlogin }, /* 49 = getlogin */
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
Date: Thu, 20 Jun 2013 11:30:11 +0100 From: Stuart Henderson st...@openbsd.org On 2013/06/20 09:38, Otto Moerbeek wrote: On Wed, Jun 19, 2013 at 07:39:15PM +, Miod Vallat wrote: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. Holy Pumpkin forbid, struct proc layout is sane enough for fields to be aligned to their natural alignment, and the allocator will return properly aligned structs as well. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. If such a function exists, it should be shot, if only for having a too long name. Miod But watch out, as kettenis@ already mentioned elsewhere, this can only be safe if *all* reads and modifications to the mask are done in an atomic way. Straight read and assignment of int values is safe, afaik, but e.g. struct assignment and memcpy are *not* atomic. And p_sigmask is copied during fork, so that needs a bit of thought. I guess it doesn't matter for the new child, as it isn't running yet and therefore can't invoke sigprocmask(2). And the parent should be safe as well, as it is in fork(2) and therefore can't call sigprocmask(2) either. I spotted a few places where p_sigmask is = or |='d, for example these from linux compat, might these be an issue? p-p_sigmask = ~bs; p-p_sigmask |= bs ~sigcantmask; There's also a couple of additional p-p_sigmask |= ... in kern_sig.c. All of these need to become atomic_setbits_int/atomic_clearbits_int calls.
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
Date: Wed, 19 Jun 2013 13:40:19 +0200 From: Martin Pelikan martin.peli...@gmail.com If you're right that atomic_{clear,set}bits_int is correct and sufficient and actually faster, then all dynamic executables would benefit from this speedup (sigprocmask is used in ld.so(1)). Since on i386 GENERIC these atomic_* things don't emit the LOCK prefix, performance shouldn't be an issue; I'm actually more worried about this bit: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. There isn't one. Unfortunately, we might need one to support SMP on hppa, where we have the emulate atomic operations using a lock. In fact you need to worry about reading as well. And since ptsignal() looks at the p_sigmask of other threads in the process, it is not immediately obvious there isn't a problem here. I think the fact that only the thread itself can change its sigmask means that there isn't an issue here. But that probably means that bothering with atomic_setbits_int/atomic_clearbits_int isn't necessary in the first place.
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
Date: Thu, 20 Jun 2013 14:25:49 +0300 From: Paul Irofti p...@irofti.net + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); I'm pretty sure this is not correct with your change to bs before the switch statement. No that's fine. The bits in sigcantmask should never be set. So atomic_clearbits_int(p-p_sigmask, bs ~sigcanmask); is equivalent to atomic_clearbits_int(p-p_sigmask, bs);
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On Thu, Jun 20, 2013 at 02:01:26PM +0200, Mark Kettenis wrote: Date: Thu, 20 Jun 2013 14:25:49 +0300 From: Paul Irofti p...@irofti.net + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); I'm pretty sure this is not correct with your change to bs before the switch statement. No that's fine. The bits in sigcantmask should never be set. So atomic_clearbits_int(p-p_sigmask, bs ~sigcanmask); is equivalent to atomic_clearbits_int(p-p_sigmask, bs); Right, nevermind. Carry on.
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. There isn't one. Unfortunately, we might need one to support SMP on hppa, where we have the emulate atomic operations using a lock. In fact you need to worry about reading as well. And since ptsignal() looks at the p_sigmask of other threads in the process, it is not immediately obvious there isn't a problem here. Does this mean that on a potentially SMP hppa just an unlocked aligned word load can produce invalid data? The HP's reference seems to be 'unavailable'... I think the fact that only the thread itself can change its sigmask means that there isn't an issue here. But that probably means that bothering with atomic_setbits_int/atomic_clearbits_int isn't necessary in the first place. Why was the splhigh() there in the first place? I thought that on i386 and the like, AND to a memory location isn't atomic (without the LOCK prefix) and therefore other thread reading the value might read garbage. Whereas MOV is ok. Obviously, sane architectures doing LD, AND and ST are ok as well, as long as the ST is atomic (I don't have any experience with hppa and random search on the internet suggested that pa-riscs do have word stores atomically). -- Martin Pelikan
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
And p_sigmask is copied during fork, so that needs a bit of thought. I guess it doesn't matter for the new child, as it isn't running yet and therefore can't invoke sigprocmask(2). And the parent should be safe as well, as it is in fork(2) and therefore can't call sigprocmask(2) either. memcpy(p_startcopy) which is, tadaaa, p_sigmask! From implementations of memcpy I've seen (usage of SSE or non-temporal instructions to begin with) it seems that no guarantee about how big a chunk does it copy atomically can be assumed... So a multithreaded process doing sigprocmask(2) in one thread and simultaneously fork(2) in the other? Not a problem either, because 1003.1 2004 edition says that sigprocmask(2) is equivalent to pthread_sigmask(3) and the fork only copies the other thread, not bothering with the sigprocmask-ing one. I haven't checked the code that thoroughly but this is what POSIX says. There's also a couple of additional p-p_sigmask |= ... in kern_sig.c. All of these need to become atomic_setbits_int/atomic_clearbits_int calls. Does the third one do the charm? I haven't compiled the other arch's, but i386 is running fine. -- Martin Pelikan Index: arch/hp300/hp300/trap.c === RCS file: /cvs/src/sys/arch/hp300/hp300/trap.c,v retrieving revision 1.62 diff -u -p -r1.62 trap.c --- arch/hp300/hp300/trap.c 31 Dec 2012 06:46:13 - 1.62 +++ arch/hp300/hp300/trap.c 20 Jun 2013 12:51:46 - @@ -321,7 +321,7 @@ dopanic: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: arch/mvme68k/mvme68k/trap.c === RCS file: /cvs/src/sys/arch/mvme68k/mvme68k/trap.c,v retrieving revision 1.77 diff -u -p -r1.77 trap.c --- arch/mvme68k/mvme68k/trap.c 31 Dec 2012 06:46:13 - 1.77 +++ arch/mvme68k/mvme68k/trap.c 20 Jun 2013 12:51:47 - @@ -260,7 +260,7 @@ copyfault: i = sigmask(SIGILL); p-p_sigacts-ps_sigignore = ~i; p-p_sigacts-ps_sigcatch = ~i; - p-p_sigmask = ~i; + atomic_clearbits_int(p-p_sigmask, i); i = SIGILL; ucode = frame.f_format; /* XXX was ILL_RESAD_FAULT */ typ = ILL_COPROC; Index: compat/linux/linux_signal.c === RCS file: /cvs/src/sys/compat/linux/linux_signal.c,v retrieving revision 1.15 diff -u -p -r1.15 linux_signal.c --- compat/linux/linux_signal.c 19 Jun 2012 11:35:29 - 1.15 +++ compat/linux/linux_signal.c 20 Jun 2013 12:51:49 - @@ -584,7 +584,6 @@ linux_sys_sigprocmask(p, v, retval) linux_old_sigset_t ss; sigset_t bs; int error = 0; - int s; *retval = 0; @@ -604,19 +603,18 @@ linux_sys_sigprocmask(p, v, retval) linux_old_to_bsd_sigset(ss, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -624,8 +622,6 @@ linux_sys_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -644,7 +640,6 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_sigset_t ls; sigset_t bs; int error = 0; - int s; if (SCARG(uap, sigsetsize) != sizeof(linux_sigset_t)) return (EINVAL); @@ -667,19 +662,18 @@ linux_sys_rt_sigprocmask(p, v, retval) linux_to_bsd_sigset(ls, bs); - s = splhigh(); - + bs = ~sigcantmask; switch (SCARG(uap, how)) { case LINUX_SIG_BLOCK: - p-p_sigmask |= bs ~sigcantmask; + atomic_setbits_int(p-p_sigmask, bs); break; case LINUX_SIG_UNBLOCK: - p-p_sigmask = ~bs; + atomic_clearbits_int(p-p_sigmask, bs); break; case LINUX_SIG_SETMASK: - p-p_sigmask = bs ~sigcantmask; + p-p_sigmask = bs; break; default: @@ -687,8 +681,6 @@ linux_sys_rt_sigprocmask(p, v, retval) break; } - splx(s); - return (error); } @@ -727,16 +719,13 @@ linux_sys_sigsetmask(p, v, retval)
help X11 performance: make sigprocmask(2) SY_NOLOCK
Hi! Recently, I had to unleash my crappy Atom and decided to run OpenBSD on it. Then I became irritated with the performance of most X11 applications, and started poking around: - libfreetype.so is quite common - it does setjmp (disguised as ft_setjmp) quite a lot - setjmp needs to do sigprocmask - sigprocmask (I presume) needs the kernel lock. sigprocmask(2) is quite simple actually, but wasn't marked as NOLOCK, so I made it so. And then I made a pointless test to demonstrate my point, and ran it on my pointless Atom N270 (two threads, one core from 1980s) as well as in VMware, which is the only 4-core amd64 machine I can get on right now. (did I say anything about using virtual machines to measure things?) #include setjmp.h int main(void) { jmp_buf env; unsigned i; /* on amd64 this was too slow, so 20M was used instead */ for (i = 0; i 400; ++i) { setjmp(env); } return (0); } Before: Atom, i386, single process, 0m3.80s real 0m1.15s user 0m2.50s system Atom, i386, two processes, 0m4.81s real 0m1.44s user 0m3.09s system amd64, single process, i20M, 0m3.35s real 0m0.65s user 0m2.23s system amd64, two processes, i20M,0m4.48s real 0m0.66s user 0m3.37s system After: Atom, i386, single process, 0m2.93s real 0m0.95s user 0m1.35s system Atom, i386, two processes, 0m3.67s real 0m1.36s user 0m2.15s system amd64, single process, i20M, 0m2.14s real 0m0.63s user 0m1.12s system amd64, two processes, i20M,0m2.57s real 0m0.57s user 0m1.55s system If anyone has a real test with some real-world application using lots of setjmps, I'd like to see it. What'd I like to see most, is if does it work on other architectures and if it is indeed correct. Although the numbers suggest quite an improvement, bear in mind this kind of load does not at all resemble any software actually doing useful stuff. Any comments? -- Martin Pelikan Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.145 diff -u -p -r1.145 init_sysent.c --- kern/init_sysent.c 9 Jun 2013 13:10:27 - 1.145 +++ kern/init_sysent.c 19 Jun 2013 09:49:57 - @@ -136,7 +136,7 @@ struct sysent sysent[] = { sys_sigaction },/* 46 = sigaction */ { 0, 0, 0, sys_getgid }, /* 47 = getgid */ - { 2, s(struct sys_sigprocmask_args), 0, + { 2, s(struct sys_sigprocmask_args), SY_NOLOCK | 0, sys_sigprocmask }, /* 48 = sigprocmask */ { 2, s(struct sys_getlogin_args), 0, sys_getlogin }, /* 49 = getlogin */ Index: kern/kern_sig.c === RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.152 diff -u -p -r1.152 kern_sig.c --- kern/kern_sig.c 1 Jun 2013 04:05:26 - 1.152 +++ kern/kern_sig.c 19 Jun 2013 09:49:57 - @@ -429,28 +429,25 @@ sys_sigprocmask(struct proc *p, void *v, syscallarg(sigset_t) mask; } */ *uap = v; int error = 0; - int s; sigset_t mask; *retval = p-p_sigmask; - mask = SCARG(uap, mask); - s = splhigh(); + mask = SCARG(uap, mask) ~ sigcantmask; switch (SCARG(uap, how)) { case SIG_BLOCK: - p-p_sigmask |= mask ~ sigcantmask; + atomic_setbits_int(p-p_sigmask, mask); break; case SIG_UNBLOCK: - p-p_sigmask = ~mask; + atomic_clearbits_int(p-p_sigmask, mask); break; case SIG_SETMASK: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; break; default: error = EINVAL; break; } - splx(s); return (error); } Index: kern/syscalls.master === RCS file: /cvs/src/sys/kern/syscalls.master,v retrieving revision 1.133 diff -u -p -r1.133 syscalls.master --- kern/syscalls.master9 Jun 2013 13:10:19 - 1.133 +++ kern/syscalls.master19 Jun 2013 09:49:57 - @@ -123,7 +123,7 @@ const struct sigaction *nsa, \ struct sigaction *osa); } 47 STD { gid_t sys_getgid(void); } -48 STD { int sys_sigprocmask(int how, sigset_t mask); } +48 STD NOLOCK { int sys_sigprocmask(int how, sigset_t mask); } 49 STD { int sys_getlogin(char *namebuf, u_int namelen); } 50 STD { int sys_setlogin(const char *namebuf); } #ifdef ACCOUNTING
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
Martin Pelikan martin.peli...@gmail.com writes: Hi! Recently, I had to unleash my crappy Atom and decided to run OpenBSD on it. Then I became irritated with the performance of most X11 applications, and started poking around: - libfreetype.so is quite common - it does setjmp (disguised as ft_setjmp) quite a lot - setjmp needs to do sigprocmask - sigprocmask (I presume) needs the kernel lock. If you're right that atomic_{clear,set}bits_int is correct and sufficient and actually faster, then all dynamic executables would benefit from this speedup (sigprocmask is used in ld.so(1)). sigprocmask(2) is quite simple actually, but wasn't marked as NOLOCK, so I made it so. And then I made a pointless test to demonstrate my point, and ran it on my pointless Atom N270 (two threads, one core from 1980s) as well as in VMware, which is the only 4-core amd64 machine I can get on right now. (did I say anything about using virtual machines to measure things?) #include setjmp.h int main(void) { jmp_buf env; unsigned i; /* on amd64 this was too slow, so 20M was used instead */ for (i = 0; i 400; ++i) { setjmp(env); } return (0); } Before: Atom, i386, single process, 0m3.80s real 0m1.15s user 0m2.50s system Atom, i386, two processes,0m4.81s real 0m1.44s user 0m3.09s system amd64, single process, i20M, 0m3.35s real 0m0.65s user 0m2.23s system amd64, two processes, i20M, 0m4.48s real 0m0.66s user 0m3.37s system After: Atom, i386, single process, 0m2.93s real 0m0.95s user 0m1.35s system Atom, i386, two processes,0m3.67s real 0m1.36s user 0m2.15s system amd64, single process, i20M, 0m2.14s real 0m0.63s user 0m1.12s system amd64, two processes, i20M, 0m2.57s real 0m0.57s user 0m1.55s system If anyone has a real test with some real-world application using lots of setjmps, I'd like to see it. What'd I like to see most, is if does it work on other architectures and if it is indeed correct. Although the numbers suggest quite an improvement, bear in mind this kind of load does not at all resemble any software actually doing useful stuff. Any comments? -- Martin Pelikan Index: kern/init_sysent.c === RCS file: /cvs/src/sys/kern/init_sysent.c,v retrieving revision 1.145 diff -u -p -r1.145 init_sysent.c --- kern/init_sysent.c9 Jun 2013 13:10:27 - 1.145 +++ kern/init_sysent.c19 Jun 2013 09:49:57 - @@ -136,7 +136,7 @@ struct sysent sysent[] = { sys_sigaction },/* 46 = sigaction */ { 0, 0, 0, sys_getgid }, /* 47 = getgid */ - { 2, s(struct sys_sigprocmask_args), 0, + { 2, s(struct sys_sigprocmask_args), SY_NOLOCK | 0, sys_sigprocmask }, /* 48 = sigprocmask */ { 2, s(struct sys_getlogin_args), 0, sys_getlogin }, /* 49 = getlogin */ Index: kern/kern_sig.c === RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.152 diff -u -p -r1.152 kern_sig.c --- kern/kern_sig.c 1 Jun 2013 04:05:26 - 1.152 +++ kern/kern_sig.c 19 Jun 2013 09:49:57 - @@ -429,28 +429,25 @@ sys_sigprocmask(struct proc *p, void *v, syscallarg(sigset_t) mask; } */ *uap = v; int error = 0; - int s; sigset_t mask; *retval = p-p_sigmask; - mask = SCARG(uap, mask); - s = splhigh(); + mask = SCARG(uap, mask) ~ sigcantmask; switch (SCARG(uap, how)) { case SIG_BLOCK: - p-p_sigmask |= mask ~ sigcantmask; + atomic_setbits_int(p-p_sigmask, mask); break; case SIG_UNBLOCK: - p-p_sigmask = ~mask; + atomic_clearbits_int(p-p_sigmask, mask); break; case SIG_SETMASK: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; break; default: error = EINVAL; break; } - splx(s); return (error); } Index: kern/syscalls.master === RCS file: /cvs/src/sys/kern/syscalls.master,v retrieving revision 1.133 diff -u -p -r1.133 syscalls.master --- kern/syscalls.master 9 Jun 2013 13:10:19 - 1.133 +++ kern/syscalls.master 19 Jun 2013 09:49:57 - @@ -123,7 +123,7 @@ const struct sigaction *nsa, \ struct sigaction *osa); } 47 STD { gid_t sys_getgid(void); } -48 STD { int sys_sigprocmask(int how, sigset_t mask); } +48 STD NOLOCK { int sys_sigprocmask(int how, sigset_t mask); } 49 STD { int
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
If you're right that atomic_{clear,set}bits_int is correct and sufficient and actually faster, then all dynamic executables would benefit from this speedup (sigprocmask is used in ld.so(1)). Since on i386 GENERIC these atomic_* things don't emit the LOCK prefix, performance shouldn't be an issue; I'm actually more worried about this bit: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. -- Martin Pelikan
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On Wed, Jun 19, 2013 at 01:40:19PM +0200, Martin Pelikan wrote: If you're right that atomic_{clear,set}bits_int is correct and sufficient and actually faster, then all dynamic executables would benefit from this speedup (sigprocmask is used in ld.so(1)). Since on i386 GENERIC these atomic_* things don't emit the LOCK prefix, performance shouldn't be an issue; I'm actually more worried about this bit: - p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. If I read the code correctly, just adding an alignment constraint to pool_init for the proc_pool would fix that 2nd part.
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
On Wed, Jun 19, 2013 at 14:19, Marc Espie wrote: On Wed, Jun 19, 2013 at 01:40:19PM +0200, Martin Pelikan wrote: If you're right that atomic_{clear,set}bits_int is correct and sufficient and actually faster, then all dynamic executables would benefit from this speedup (sigprocmask is used in ld.so(1)). Since on i386 GENERIC these atomic_* things don't emit the LOCK prefix, performance shouldn't be an issue; I'm actually more worried about this bit: -p-p_sigmask = mask ~ sigcantmask; +p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. If I read the code correctly, just adding an alignment constraint to pool_init for the proc_pool would fix that 2nd part. pool will never give you an object that spans pages (unless the object itself is larger than a page). p_sigmask will always be aligned nicely. i mean, if it weren't, that code already would crash.
Re: help X11 performance: make sigprocmask(2) SY_NOLOCK
- p-p_sigmask = mask ~ sigcantmask; + p-p_sigmask = mask; On the right architecture where a word store isn't atomic enough and with the right compiler that decides to put p_sigmask on an address ending with 0xFFF with 4k-sized pages, we have two problems already. Holy Pumpkin forbid, struct proc layout is sane enough for fields to be aligned to their natural alignment, and the allocator will return properly aligned structs as well. I'm only asking if such a situation can happen, or if there is some ensure_this_assignment_is_always_atomic(p-p_sigmask, mask); function that I missed. If such a function exists, it should be shot, if only for having a too long name. Miod