Re: Performance issue
I think I've found the problem: Python uses setjmp/longjmp to protect against SIGFPU every time it does floating point operations. The python script does not actually use threads, and libpthread assumes non-threaded processes are system scope. So, it would end up using the sigprocmask syscall, even though it doesn't really need to. The diff at http://people.freebsd.org/~ssouhlal/testing/ thr_sigmask-20050509.diff fixes this, by making sure the process is threaded, before using the syscall. Note that the setjmp/longjmp code is only active if Python is ./configure'd with -with-fpectl, which has been standard for the ports built Python for a long time. ISTR that this was because FreeBSD didn't mask SIGFPE by default, while Linux and many other OSes do. I also seem to recall that this may have changed in the evolution of 5.x. If so, perhaps use of this configure option in the port needs to be reviewed for 5.x and later. Well, I don't know what else it breaks, but for this microbenchmark, compiling python-2.4.1 without -with-fpectl works swimmingly well for me. Not only does it bring the system time way down, but the user time is down too, to about 5/7 of its previous value: 5.3-RELEASE / without -with-fpectl 48.78 real48.22 user 0.15 sys 23372 maximum resident set size 657 average shared memory size 20817 average unshared data size 128 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 4889 involuntary context switches compared with 5.3-RELEASE / with -with-fpectl 106.59 real67.25 user38.57 sys 23140 maximum resident set size 660 average shared memory size 20818 average unshared data size 128 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 10678 involuntary context switches I tentatively second Andrew's proposal that the use of this configure option in the port needs to be reviewed for 5.x and later, pending independent confirmation of the efficacy of this fix. -e ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Performance issue
Hi All, I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. I'd appreciate it if someone would it point out if I'm overlooking something obvious. Otherwise, if it is the problem I think it is, then there seems entirely too little acknowledgement of a major issue. Here's the background. I just got a new (to me) AMD machine and put 5.3 release on it. I'd been very happy with the way my old Intel machine had been performing with 4.10 stable, and I decided to run a simple performance diagnostic on both machines, to wow myself with the amazing performance of the new hardware / kernel combination. However, the result was pretty disappointing. Here are what I think are the pertinent dmesg details. Old rig: FreeBSD 4.10-RELEASE #0: Thu Jul 1 22:47:08 EDT 2004 Timecounter i8254 frequency 1193182 Hz Timecounter TSC frequency 449235058 Hz CPU: Pentium III/Pentium III Xeon/Celeron (449.24-MHz 686-class CPU) New rig: FreeBSD 5.3-RELEASE #0: Fri Nov 5 04:19:18 UTC 2004 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) Processor (995.77-MHz 686-class CPU) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 Timecounter TSC frequency 995767383 Hz quality 800 Timecounters tick every 10.000 msec The diagnostic I selected was a python program to generate 1 million pseudo-random numbers and then to perform a heap sort on them. That code is included at the foot of this email. I named the file heapsort.py. I ran it on both machines, using the time utility in /usr/bin/ (not the builtin tcsh time). So the command line was /usr/bin/time -al -o heapsort.data ./heapsort.py 100 A typical result for the old rig was 130.78 real 129.86 user 0.11 sys 22344 maximum resident set size 608 average shared memory size 20528 average unshared data size 128 average unshared stack size 5360 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 2386 involuntary context switches Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys 23376 maximum resident set size 659 average shared memory size 20796 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 10548 involuntary context switches You'll notice that the new rig is indeed a little faster (times in seconds): 105.36 real (new rig) compared with 130.78 real (old rig). However, the new rig spends about 33.41 seconds on system overhead compared with just 0.11 seconds on the old rig. Comparing the rusage stats, the only significant difference is the involuntary context switches field, where the old rig has 2386 and the new rig has a whopping 10548. Further, I noticed that the number of context switches on the new rig seems to be more or less exactly one per 10 msec of real time, that is, one per timecounter tick. (I saw this when comparing heapsort.py runs with arguments other than 100.) I think the new rig ought to execute this task in about 70 seconds: just over the amount of user time. Assuming that I'm not overlooking something obvious, and that I'm not interpreting a feature as a bug, this business with the context switches strikes me as a bit of a show-stopper. If that's right, it appears to be severely underplayed in the release documentation. I'll be happy if someone would kindly explain to me what's going on here. I'll be even happier to hear of a fix or workaround to remedy the situation. Thanks in advance, -e heapsort.py: #!/usr/local/bin/python -O # $Id: heapsort-python-3.code,v 1.3 2005/04/04 14:56:45 bfulgham Exp $ # # The Great Computer Language Shootout # http://shootout.alioth.debian.org/ # # Updated by Valentino Volonghi for Python 2.4 # Reworked by Kevin Carson to produce correct results and same intent import sys IM = 139968 IA = 3877 IC = 29573 LAST = 42 def gen_random(max) : global LAST LAST = (LAST * IA + IC) % IM return( (max * LAST) / IM ) def heapsort(n, ra) : ir = n l = (n 1) + 1 while True : if l 1 : l -= 1 rra = ra[l] else : rra = ra[ir] ra[ir] = ra[1] ir -= 1 if ir == 1 : ra[1] =
Re: Performance issue
Whereas, the typical result for the new rig looked more like 105.36 real71.10 user33.41 sys ... 10548 involuntary context switches First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. Scott First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete French for running the test independently on the higher spec machines with 5.4 RC4 on them, confirming the system time thing, ruling out an AMD problem, dissociating the system time result from the context switching, and saving me the trouble of rediscovering the same problem on 5.4 RC4. This is my first foray into the public world of FreeBSD discussion lists, and I am encouraged by the helpfulness of the response. Scott, the 5.3 kernel I had was a essentially a GENERIC release kernel, with about 100 options commented out. WITNESS and INVARIANTS are off by default, which I confirmed by looking through `sysctl -a`. However, I was curious to see what I would get if I switched them on, so I added these options and recompiled the kernel: options KDB options DDB options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN The result, below, has essentially the same user time (or just less, if that makes any sense), but tripled system time. The context switches are consistent with the one-per-10msec I saw before. Is there anything useful I can do while I still have the kernel debug options on? -e 172.29 real67.53 user 103.07 sys 23376 maximum resident set size 659 average shared memory size 20805 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 17234 involuntary context switches ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Performance issue
5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? I'm pretty sure there's no SMP in this kernel. #cd /usr/src/sys/i386/conf #fgrep SMP MYKERNEL # GENERIC has no SMP in it, but there's a second GENERIC kernel conf called SMP, which simply says: include GENERIC options SMP However, sysctl seems to show smp not active, but not disabled. Is that anything to worry about? #sysctl -a | grep smp kern.smp.maxcpus: 1 kern.smp.active: 0 kern.smp.disabled: 0 kern.smp.cpus: 1 debug.psmpkterrthresh: 2 -e ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]