Re: Performance issue
Hi, On May 10, 2005, at 1:24 AM, Daniel Eischen wrote: No, libc_r wraps execve() and a lot of other syscalls that libpthread or libthr don't need to. Take a look at libc_r/uthread/ uthread_execve.c and you will see it sets the signal mask before exec()ing. Couldn't we do the same thing in libpthread, in the not-threaded case? I apologize if I'm asking stupid questions.. :) -- Suleiman Souhlal | [EMAIL PROTECTED] The FreeBSD Project | [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Tue, 10 May 2005, Suleiman Souhlal wrote: Hi, On May 10, 2005, at 1:24 AM, Daniel Eischen wrote: No, libc_r wraps execve() and a lot of other syscalls that libpthread or libthr don't need to. Take a look at libc_r/uthread/ uthread_execve.c and you will see it sets the signal mask before exec()ing. Couldn't we do the same thing in libpthread, in the not-threaded case? I apologize if I'm asking stupid questions.. :) No ;-) We don't want to wrap functions unecessarily. Applications not linked to a thread library still have to use the actual syscall, so there's no point in wrapping extra functions just to make sigprocmask() faster when linked with libpthread. -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Suleiman Souhlal wrote: Hello, On May 9, 2005, at 3:54 PM, Daniel Eischen wrote: On Tue, 10 May 2005, Peter Jeremy wrote: On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote: I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. There does appear to be a problem in FreeBSD. Python is built with threading enabled by default, the threading libraries play with the signal mask and there have been extensive changes there. My The threading libraries don't play with the signal mask. In fact, libpthread has userland versions of sigprocmask() et. al. and won't even make the syscall() unless the threads are system scope. There is a special thread in libpthread that handles signals which does use the system sigprocmask(), but unless the application is making heavy use of signals in general, it shouldn't matter. I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. Note that the setjmp/longjmp code is only active if Python is ./configure'd with -with-fpectl, which has been standard for the ports built Python for a long time. ISTR that this was because FreeBSD didn't mask SIGFPE by default, while Linux and many other OSes do. I also seem to recall that this may have changed in the evolution of 5.x. If so, perhaps use of this configure option in the port needs to be reviewed for 5.x and later. - Andrew I MacIntyre These thoughts are mine alone... E-mail: [EMAIL PROTECTED] (pref) | Snail: PO Box 370 [EMAIL PROTECTED] (alt) |Belconnen ACT 2616 Web:http://www.andymac.org/ |Australia ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. Note that the setjmp/longjmp code is only active if Python is ./configure'd with -with-fpectl, which has been standard for the ports built Python for a long time. ISTR that this was because FreeBSD didn't mask SIGFPE by default, while Linux and many other OSes do. I also seem to recall that this may have changed in the evolution of 5.x. If so, perhaps use of this configure option in the port needs to be reviewed for 5.x and later. Well, I don't know what else it breaks, but for this microbenchmark, compiling python-2.4.1 without -with-fpectl works swimmingly well for me. Not only does it bring the system time way down, but the user time is down too, to about 5/7 of its previous value: 5.3-RELEASE / without -with-fpectl 48.78 real48.22 user 0.15 sys 23372 maximum resident set size 657 average shared memory size 20817 average unshared data size 128 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 4889 involuntary context switches compared with 5.3-RELEASE / with -with-fpectl 106.59 real67.25 user38.57 sys 23140 maximum resident set size 660 average shared memory size 20818 average unshared data size 128 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 10678 involuntary context switches I tentatively second Andrew's proposal that the use of this configure option in the port needs to be reviewed for 5.x and later, pending independent confirmation of the efficacy of this fix. -e ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Ewan Todd wrote: Hi All, I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. I'd appreciate it if someone would it point out if I'm overlooking something obvious. Otherwise, if it is the problem I think it is, then there seems entirely too little acknowledgement of a major issue. Here's the background. I just got a new (to me) AMD machine and put 5.3 release on it. I'd been very happy with the way my old Intel machine had been performing with 4.10 stable, and I decided to run a simple performance diagnostic on both machines, to wow myself with the amazing performance of the new hardware / kernel combination. However, the result was pretty disappointing. Here are what I think are the pertinent dmesg details. Old rig: FreeBSD 4.10-RELEASE #0: Thu Jul 1 22:47:08 EDT 2004 Timecounter i8254 frequency 1193182 Hz Timecounter TSC frequency 449235058 Hz CPU: Pentium III/Pentium III Xeon/Celeron (449.24-MHz 686-class CPU) New rig: FreeBSD 5.3-RELEASE #0: Fri Nov 5 04:19:18 UTC 2004 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) Processor (995.77-MHz 686-class CPU) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 Timecounter TSC frequency 995767383 Hz quality 800 Timecounters tick every 10.000 msec The diagnostic I selected was a python program to generate 1 million pseudo-random numbers and then to perform a heap sort on them. That code is included at the foot of this email. I named the file heapsort.py. I ran it on both machines, using the time utility in /usr/bin/ (not the builtin tcsh time). So the command line was /usr/bin/time -al -o heapsort.data ./heapsort.py 100 A typical result for the old rig was 130.78 real 129.86 user 0.11 sys 22344 maximum resident set size 608 average shared memory size 20528 average unshared data size 128 average unshared stack size 5360 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 2386 involuntary context switches Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys 23376 maximum resident set size 659 average shared memory size 20796 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 10548 involuntary context switches You'll notice that the new rig is indeed a little faster (times in seconds): 105.36 real (new rig) compared with 130.78 real (old rig). However, the new rig spends about 33.41 seconds on system overhead compared with just 0.11 seconds on the old rig. Comparing the rusage stats, the only significant difference is the involuntary context switches field, where the old rig has 2386 and the new rig has a whopping 10548. Further, I noticed that the number of context switches on the new rig seems to be more or less exactly one per 10 msec of real time, that is, one per timecounter tick. (I saw this when comparing heapsort.py runs with arguments other than 100.) I think the new rig ought to execute this task in about 70 seconds: just over the amount of user time. Assuming that I'm not overlooking something obvious, and that I'm not interpreting a feature as a bug, this business with the context switches strikes me as a bit of a show-stopper. If that's right, it appears to be severely underplayed in the release documentation. I'll be happy if someone would kindly explain to me what's going on here. I'll be even happier to hear of a fix or workaround to remedy the situation. Thanks in advance, -e First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
At 11:00 AM 09/05/2005, Ewan Todd wrote: Here's the background. I just got a new (to me) AMD machine and put 5.3 release on it. I'd been very happy with the way my old Intel There have been quite a few changes since 5.3. If you are starting fresh, I would strongly recommend going to 5.4 RC4. There have been a number of improvements that might help the problems you are seeing. ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys ... 10548 involuntary context switches Now I just ran this test myself. This machine is a 2.4 gig P4 with hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig Athlon to be significantly slower than a 2.4 gig Pentium 4. but: 93.45 real56.55 user36.85 sys 1857 involuntary context switches Uhhh... so it takes almost the same time to do the calculation, but spends actually *more* of it in system space. Does far less context switches though, but I am assuming thats due to HTT. Numbers look very odd to me. So I then ran it on another P4 system we have round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4 so it should be a bit faster, but: 33.77 real33.49 user 0.07 sys 711 involuntary context switches Over two and a half times faster ?! Thats not right at all! All the new systems I have tried are 5.4-RC4, so should be the latest and greatest. When my colleague finishes on his machine I can try a GENERIC 5.4-RC4 kernel on another P4 and see what that gives. -pcf. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Performance issue
Same test on a 5.3-STABLE from 31.01.2005: 81,90 real77,05 user 3,51 sys 22908 maximum resident set size 620 average shared memory size 20083 average unshared data size 128 average unshared stack size 5379 page reclaims 26 page faults 0 swaps 36 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 62 voluntary context switches 10623 involuntary context switches This is a on a slow dual-processor system: Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel Pentium III (732.98-MHz 686-class CPU) Origin = GenuineIntel Id = 0x683 Stepping = 3 Features=0x387fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM OV,PAT,PSE36,PN,MMX,FXSR,SSE real memory = 2281635840 (2175 MB) avail memory = 2232012800 (2128 MB) ACPI APIC Table: PTLTDAPIC FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 (if this is of any help). Scheduler is 4BSD. Regards, Holger Kipp -Original Message- From: [EMAIL PROTECTED] on behalf of Pete French Sent: Mon 09.05.2005 18:10 To: [EMAIL PROTECTED]; freebsd-stable@freebsd.org Subject: Re: Performance issue Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys ... 10548 involuntary context switches Now I just ran this test myself. This machine is a 2.4 gig P4 with hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig Athlon to be significantly slower than a 2.4 gig Pentium 4. but: 93.45 real56.55 user36.85 sys 1857 involuntary context switches Uhhh... so it takes almost the same time to do the calculation, but spends actually *more* of it in system space. Does far less context switches though, but I am assuming thats due to HTT. Numbers look very odd to me. So I then ran it on another P4 system we have round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4 so it should be a bit faster, but: 33.77 real33.49 user 0.07 sys 711 involuntary context switches Over two and a half times faster ?! Thats not right at all! All the new systems I have tried are 5.4-RC4, so should be the latest and greatest. When my colleague finishes on his machine I can try a GENERIC 5.4-RC4 kernel on another P4 and see what that gives. -pcf. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys ... 10548 involuntary context switches First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. Scott First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete French for running the test independently on the higher spec machines with 5.4 RC4 on them, confirming the system time thing, ruling out an AMD problem, dissociating the system time result from the context switching, and saving me the trouble of rediscovering the same problem on 5.4 RC4. This is my first foray into the public world of FreeBSD discussion lists, and I am encouraged by the helpfulness of the response. Scott, the 5.3 kernel I had was a essentially a GENERIC release kernel, with about 100 options commented out. WITNESS and INVARIANTS are off by default, which I confirmed by looking through `sysctl -a`. However, I was curious to see what I would get if I switched them on, so I added these options and recompiled the kernel: options KDB options DDB options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN The result, below, has essentially the same user time (or just less, if that makes any sense), but tripled system time. The context switches are consistent with the one-per-10msec I saw before. Is there anything useful I can do while I still have the kernel debug options on? -e 172.29 real67.53 user 103.07 sys 23376 maximum resident set size 659 average shared memory size 20805 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 17234 involuntary context switches ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Ewan Todd wrote: Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys ... 10548 involuntary context switches First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. Scott First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete French for running the test independently on the higher spec machines with 5.4 RC4 on them, confirming the system time thing, ruling out an AMD problem, dissociating the system time result from the context switching, and saving me the trouble of rediscovering the same problem on 5.4 RC4. This is my first foray into the public world of FreeBSD discussion lists, and I am encouraged by the helpfulness of the response. Scott, the 5.3 kernel I had was a essentially a GENERIC release kernel, with about 100 options commented out. WITNESS and INVARIANTS are off by default, which I confirmed by looking through `sysctl -a`. However, I was curious to see what I would get if I switched them on, so I added these options and recompiled the kernel: options KDB options DDB options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN The result, below, has essentially the same user time (or just less, if that makes any sense), but tripled system time. The context switches are consistent with the one-per-10msec I saw before. Is there anything useful I can do while I still have the kernel debug options on? -e 172.29 real67.53 user 103.07 sys 23376 maximum resident set size 659 average shared memory size 20805 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 17234 involuntary context switches 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I'm pretty sure there's no SMP in this kernel. #cd /usr/src/sys/i386/conf #fgrep SMP MYKERNEL # GENERIC has no SMP in it, but there's a second GENERIC kernel conf called SMP, which simply says: include GENERIC options SMP However, sysctl seems to show smp not active, but not disabled. Is that anything to worry about? #sysctl -a | grep smp kern.smp.maxcpus: 1 kern.smp.active: 0 kern.smp.disabled: 0 kern.smp.cpus: 1 debug.psmpkterrthresh: 2 -e ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Ewan Todd wrote: 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I'm pretty sure there's no SMP in this kernel. #cd /usr/src/sys/i386/conf #fgrep SMP MYKERNEL # GENERIC has no SMP in it, but there's a second GENERIC kernel conf called SMP, which simply says: include GENERIC options SMP However, sysctl seems to show smp not active, but not disabled. Is that anything to worry about? #sysctl -a | grep smp kern.smp.maxcpus: 1 kern.smp.active: 0 kern.smp.disabled: 0 kern.smp.cpus: 1 debug.psmpkterrthresh: 2 -e Bah, you're right, sorry for the confusion. Too many releases in my mind, they all seem like a blur. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the following result: 67.52 real41.13 user26.16 sys 7034 involuntary context switches i.e. it still has system time a a huge proportion of the total compared to the 4.11 kernel. Interesingly, after reading Holger Kipp's results I tried it on a genuine multi-processor box with SMP enabled running 5.3. He got a very small percentage of the time in sys (3.51 out of 81.90) but I got: 255.30 real 160.20 user88.50 sys Once again a far higher proprtion of the time spent in sys than you would expect. -pcf. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Scott Long wrote: First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. I can confirm this. I just rerun it on RELENG_5_4 as of yesterday and got 136.52 real80.29 user50.16 sys 23212 maximum resident set size 674 average shared memory size 20961 average unshared data size 128 average unshared stack size 5419 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25738 involuntary context switches No debugging or SMP in kernel. -- Best regards, Alexander. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Hello, On May 9, 2005, at 1:31 PM, Pete French wrote: 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the following result: 67.52 real41.13 user26.16 sys 7034 involuntary context switches i.e. it still has system time a a huge proportion of the total compared to the 4.11 kernel. Interesingly, after reading Holger Kipp's results I tried it on a genuine multi-processor box with SMP enabled running 5.3. He got a very small percentage of the time in sys (3.51 out of 81.90) but I got: 255.30 real 160.20 user88.50 sys Once again a far higher proprtion of the time spent in sys than you would expect. I ran ktrace(1) on it, and it appears that python keeps calling sigprocmask() continually: 673 python 0.07 CALL sigprocmask(0x3,0,0x811d11c) 673 python 0.05 RET sigprocmask 0 673 python 0.09 CALL sigprocmask(0x1,0,0x8113d1c) 673 python 0.05 RET sigprocmask 0 etc.. This explains why it's using so much system time. Now the question is why is python doing this? -- Suleiman Souhlal | [EMAIL PROTECTED] The FreeBSD Project | [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Mon, May 9, 2005 1:06 pm, Scott Long said: 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? This is what i get on my system, which has debugging and smp off in the kernel. FreeBSD 6.0-CURRENT #0: Tue May 3 23:55:43 EDT 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/DP Timecounter i8254 frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) Processor (1410.21-MHz 686-class CPU) Origin = AuthenticAMD Id = 0x644 Stepping = 4 Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR AMD Features=0xc044RSVD,AMIE,DSP,3DNow! --- 76.89 real49.33 user22.87 sys 23116 maximum resident set size 686 average shared memory size 20795 average unshared data size 127 average unshared stack size 5380 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 1 voluntary context switches 10018 involuntary context switches --- As we can see, it is still spending a lot of time in system, and there are a lot of context switches being done. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Mon, 9 May 2005, Suleiman Souhlal wrote: Hello, I ran ktrace(1) on it, and it appears that python keeps calling sigprocmask() continually: 673 python 0.07 CALL sigprocmask(0x3,0,0x811d11c) 673 python 0.05 RET sigprocmask 0 673 python 0.09 CALL sigprocmask(0x1,0,0x8113d1c) 673 python 0.05 RET sigprocmask 0 etc.. This explains why it's using so much system time. Now the question is why is python doing this? I don't know what python's doing, but it might not be calling sigprocmask directly. There are a few libc functions that use sigprocmask: db/btree/ db/hash/ pselect(), setmode(), {sig}setjmp(), {sig}longjmp(), grantpt(), system() to name a few. Python may also be using other libraries which use sigprocmask(). -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On 5/9/2005 12:31 PM, Pete French wrote: 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the following result: 67.52 real41.13 user26.16 sys 7034 involuntary context switches i.e. it still has system time a a huge proportion of the total compared to the 4.11 kernel. Interesingly, after reading Holger Kipp's results I tried it on a genuine multi-processor box with SMP enabled running 5.3. He got a very small percentage of the time in sys (3.51 out of 81.90) but I got: 255.30 real 160.20 user88.50 sys Once again a far higher proprtion of the time spent in sys than you would expect. I ran into a similar issue when attempting to thread a card game solver program I wrote. Performance in early versions was horrific and I noticed tons of context switches. I resolved the issue by allocating pools of memory beforehand. This seems to point the finger to malloc and context switch overhead. In any case, I believe this is related to threading. Check your results with libthr instead. The following are on my 2.53 GHz P4 which is running CURRENT from last night (with INVARIANTS on). libpthread: $ /usr/bin/time -al ./heapsort.py 100 0.928555 124.04 real65.71 user48.47 sys 23464 maximum resident set size 680 average shared memory size 21104 average unshared data size 129 average unshared stack size 5400 page reclaims 0 page faults 0 swaps 15 block input operations 0 block output operations 4 messages sent 0 messages received 0 signals received 21 voluntary context switches 40274 involuntary context switches libthr: $ /usr/bin/time -al ./heapsort.py 100 0.928555 79.75 real50.63 user25.34 sys 23348 maximum resident set size 679 average shared memory size 21041 average unshared data size 129 average unshared stack size 5394 page reclaims 1 page faults 0 swaps 2 block input operations 0 block output operations 3 messages sent 0 messages received 0 signals received 7 voluntary context switches 26113 involuntary context switches -- Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195 signature.asc Description: OpenPGP digital signature
Re: Performance issue
On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote: I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. There does appear to be a problem in FreeBSD. Python is built with threading enabled by default, the threading libraries play with the signal mask and there have been extensive changes there. My suggestions on things you could check are: 1) Rebuild python with threading disabled (add '-DWITHOUT_THREADS' to the 'make' command line and see if that makes any difference 2) Re-write the sample program in a non-threaded language - eg C or perl and see if the high system time goes away. Unfortunately, I can't think of a solution at present. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Tue, 10 May 2005, Peter Jeremy wrote: On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote: I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. There does appear to be a problem in FreeBSD. Python is built with threading enabled by default, the threading libraries play with the signal mask and there have been extensive changes there. My The threading libraries don't play with the signal mask. In fact, libpthread has userland versions of sigprocmask() et. al. and won't even make the syscall() unless the threads are system scope. There is a special thread in libpthread that handles signals which does use the system sigprocmask(), but unless the application is making heavy use of signals in general, it shouldn't matter. suggestions on things you could check are: 1) Rebuild python with threading disabled (add '-DWITHOUT_THREADS' to the 'make' command line and see if that makes any difference 2) Re-write the sample program in a non-threaded language - eg C or perl and see if the high system time goes away. Unfortunately, I can't think of a solution at present. You can also set LIBPTHREAD_SYSTEM_SCOPE in the environment to force libpthread to use system scope threads. It uses process scope threads by default. -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
Daniel Eischen wrote: {sig}setjmp(), {sig}longjmp(), A very wild guess.. python is using setjmp/longjmp to implement continuations, tailcalls, or any mechanism similar to that and using that in a loop? mkb. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On 5/9/2005 1:30 PM, Jonathan Noack wrote: On 5/9/2005 12:31 PM, Pete French wrote: 5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the following result: 67.52 real41.13 user26.16 sys 7034 involuntary context switches i.e. it still has system time a a huge proportion of the total compared to the 4.11 kernel. Interesingly, after reading Holger Kipp's results I tried it on a genuine multi-processor box with SMP enabled running 5.3. He got a very small percentage of the time in sys (3.51 out of 81.90) but I got: 255.30 real 160.20 user88.50 sys Once again a far higher proprtion of the time spent in sys than you would expect. I ran into a similar issue when attempting to thread a card game solver program I wrote. Performance in early versions was horrific and I noticed tons of context switches. I resolved the issue by allocating pools of memory beforehand. This seems to point the finger to malloc and context switch overhead. In any case, I believe this is related to threading. Check your results with libthr instead. The following are on my 2.53 GHz P4 which is running CURRENT from last night (with INVARIANTS on). libpthread: $ /usr/bin/time -al ./heapsort.py 100 0.928555 124.04 real65.71 user48.47 sys 23464 maximum resident set size 680 average shared memory size 21104 average unshared data size 129 average unshared stack size 5400 page reclaims 0 page faults 0 swaps 15 block input operations 0 block output operations 4 messages sent 0 messages received 0 signals received 21 voluntary context switches 40274 involuntary context switches libthr: $ /usr/bin/time -al ./heapsort.py 100 0.928555 79.75 real50.63 user25.34 sys 23348 maximum resident set size 679 average shared memory size 21041 average unshared data size 129 average unshared stack size 5394 page reclaims 1 page faults 0 swaps 2 block input operations 0 block output operations 3 messages sent 0 messages received 0 signals received 7 voluntary context switches 26113 involuntary context switches Oooh... same machine with libc_r: $ /usr/bin/time -al ./heapsort.py 100 0.928555 38.72 real36.85 user 0.06 sys 23496 maximum resident set size 678 average shared memory size 21126 average unshared data size 129 average unshared stack size 5418 page reclaims 2 page faults 0 swaps 2 block input operations 0 block output operations 3 messages sent 0 messages received 0 signals received 8 voluntary context switches 13137 involuntary context switches -- Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195 signature.asc Description: OpenPGP digital signature
Re: Performance issue
Hello, On May 9, 2005, at 3:54 PM, Daniel Eischen wrote: On Tue, 10 May 2005, Peter Jeremy wrote: On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote: I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. There does appear to be a problem in FreeBSD. Python is built with threading enabled by default, the threading libraries play with the signal mask and there have been extensive changes there. My The threading libraries don't play with the signal mask. In fact, libpthread has userland versions of sigprocmask() et. al. and won't even make the syscall() unless the threads are system scope. There is a special thread in libpthread that handles signals which does use the system sigprocmask(), but unless the application is making heavy use of signals in general, it shouldn't matter. I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. -- Suleiman Souhlal | [EMAIL PROTECTED] The FreeBSD Project | [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Mon, 9 May 2005, Suleiman Souhlal wrote: On May 9, 2005, at 3:54 PM, Daniel Eischen wrote: The threading libraries don't play with the signal mask. In fact, libpthread has userland versions of sigprocmask() et. al. and won't even make the syscall() unless the threads are system scope. There is a special thread in libpthread that handles signals which does use the system sigprocmask(), but unless the application is making heavy use of signals in general, it shouldn't matter. I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. I don't think that patch is correct. You need the signal mask in the kernel to match in case of an exec() after a fork() for instance. If the application fork()'s, then changes the signal mask in the child (which is now single threaded), then the child exec()s, the mask is wrong. If the process wasn't linked to libpthread, then the longjmp() and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Mon, 9 May 2005, Daniel Eischen wrote: On Mon, 9 May 2005, Suleiman Souhlal wrote: I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. [ ... ] If the process wasn't linked to libpthread, then the longjmp() and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). As in something like this: http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs It has not been tested. -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On 05/09/05 18:47, Daniel Eischen wrote: On Mon, 9 May 2005, Daniel Eischen wrote: On Mon, 9 May 2005, Suleiman Souhlal wrote: I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. [ ... ] If the process wasn't linked to libpthread, then the longjmp() and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). As in something like this: http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs It has not been tested. When I tried to test this every threaded program died with sig 11. Does this require me to recompile the program before it will work? -- Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195 signature.asc Description: OpenPGP digital signature
Re: Performance issue
Hello, On May 9, 2005, at 7:21 PM, Daniel Eischen wrote: I don't think that patch is correct. You need the signal mask in the kernel to match in case of an exec() after a fork() for instance. If the application fork()'s, then changes the signal mask in the child (which is now single threaded), then the child exec()s, the mask is wrong. If the process wasn't linked to libpthread, then the longjmp() and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). Processes linked with libc_r NEVER call the syscall, once they have started (after rtld-elf): zZzZ:~/py% LD_LIBMAP=libpthread.so.1=libc_r.so.5 ktrace -t c python heapsort.py 1 /dev/null kdump -T | grep sigprocmask 2991 python 1115698354.240301 CALL sigprocmask (0x1,0x2810a820,0xbfbfea60) 2991 python 1115698354.240304 RET sigprocmask 0 2991 python 1115698354.240307 CALL sigprocmask(0x3,0x2810a830,0) 2991 python 1115698354.240308 RET sigprocmask 0 zZzZ:~/py% compare with libpthread: zZzZ:~/py% ktrace -t c python heapsort.py 1 /dev/null kdump - T | grep -c sigprocmask 92114 zZzZ:~/py% Is this a bug in libc_r? -- Suleiman Souhlal | [EMAIL PROTECTED] The FreeBSD Project | [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Mon, 9 May 2005, Jonathan Noack wrote: On 05/09/05 18:47, Daniel Eischen wrote: If the process wasn't linked to libpthread, then the longjmp() and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). As in something like this: http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs It has not been tested. When I tried to test this every threaded program died with sig 11. Does this require me to recompile the program before it will work? No, the patch just must have a bug in it. -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
On Tue, 10 May 2005, Suleiman Souhlal wrote: Hello, On May 9, 2005, at 7:21 PM, Daniel Eischen wrote: I don't think that patch is correct. You need the signal mask in the kernel to match in case of an exec() after a fork() for instance. If the application fork()'s, then changes the signal mask in the child (which is now single threaded), then the child exec()s, the mask is wrong. If the process wasn't linked to libpthread, then the longjmp() ^^ or any thread library. and setjmp() would still be calling the syscall, so it isn't the syscall itself that is making things slower. You'll notice that there are two calls to __sys_sigprocmask() in the section of code you have patched. You could eliminate the second call if you do some of what the remainder of the function does instead of returning early (the locks aren't needed and pending signals don't need to be run down). Processes linked with libc_r NEVER call the syscall, once they have started (after rtld-elf): [...] Is this a bug in libc_r? No, libc_r wraps execve() and a lot of other syscalls that libpthread or libthr don't need to. Take a look at libc_r/uthread/uthread_execve.c and you will see it sets the signal mask before exec()ing. -- DE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]