Re: Performance issue

2005-05-10 Thread Suleiman Souhlal
Hi,
On May 10, 2005, at 1:24 AM, Daniel Eischen wrote:
No, libc_r wraps execve() and a lot of other syscalls that libpthread
or libthr don't need to.  Take a look at libc_r/uthread/ 
uthread_execve.c
and you will see it sets the signal mask before exec()ing.
Couldn't we do the same thing in libpthread, in the not-threaded case?
I apologize if I'm asking stupid questions.. :)
--
Suleiman Souhlal | [EMAIL PROTECTED]
The FreeBSD Project  | [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-10 Thread Daniel Eischen
On Tue, 10 May 2005, Suleiman Souhlal wrote:

 Hi,

 On May 10, 2005, at 1:24 AM, Daniel Eischen wrote:

  No, libc_r wraps execve() and a lot of other syscalls that libpthread
  or libthr don't need to.  Take a look at libc_r/uthread/
  uthread_execve.c
  and you will see it sets the signal mask before exec()ing.

 Couldn't we do the same thing in libpthread, in the not-threaded case?
 I apologize if I'm asking stupid questions.. :)

No ;-)  We don't want to wrap functions unecessarily.  Applications
not linked to a thread library still have to use the actual syscall,
so there's no point in wrapping extra functions just to make
sigprocmask() faster when linked with libpthread.

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-10 Thread Andrew MacIntyre
Suleiman Souhlal wrote:
Hello,
On May 9, 2005, at 3:54 PM, Daniel Eischen wrote:
On Tue, 10 May 2005, Peter Jeremy wrote:

On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote:
I have what I think is a serious performance issue with fbsd 5.3
release.  I've read about threading issues, and it seems to me that
that is what I'm looking at, but I'm not confident enough to rule  out
that it might be a hardware issue, a kernel configuration issue, or
something to do with the python port.
There does appear to be a problem in FreeBSD.  Python is built with
threading enabled by default, the threading libraries play with the
signal mask and there have been extensive changes there.  My
The threading libraries don't play with the signal mask.  In fact,
libpthread has userland versions of sigprocmask() et. al. and won't
even make the syscall() unless the threads are system scope.  There
is a special thread in libpthread that handles signals which does
use the system sigprocmask(), but unless the application is
making heavy use of signals in general, it shouldn't matter.

I think I've found the problem: Python uses setjmp/longjmp to protect  
against SIGFPU every time it does floating point operations. The  python 
script does not actually use threads, and libpthread assumes  
non-threaded processes are system scope. So, it would end up using  the 
sigprocmask syscall, even though it doesn't really need to.
The diff at http://people.freebsd.org/~ssouhlal/testing/ 
thr_sigmask-20050509.diff fixes this, by making sure the process is  
threaded, before using the syscall.
Note that the setjmp/longjmp code is only active if Python is 
./configure'd  with -with-fpectl, which has been standard for the 
ports built Python for a long time.

ISTR that this was because FreeBSD didn't mask SIGFPE by default, while
Linux and many other OSes do.  I also seem to recall that this may have 
changed in the evolution of 5.x.  If so, perhaps use of this configure
option in the port needs to be reviewed for 5.x and later.

-
Andrew I MacIntyre These thoughts are mine alone...
E-mail: [EMAIL PROTECTED]  (pref) | Snail: PO Box 370
   [EMAIL PROTECTED] (alt) |Belconnen ACT 2616
Web:http://www.andymac.org/   |Australia
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-10 Thread Ewan Todd
 
 I think I've found the problem: Python uses setjmp/longjmp to protect  
 against SIGFPU every time it does floating point operations. The  python 
 script does not actually use threads, and libpthread assumes  
 non-threaded processes are system scope. So, it would end up using  the 
 sigprocmask syscall, even though it doesn't really need to.
 The diff at http://people.freebsd.org/~ssouhlal/testing/ 
 thr_sigmask-20050509.diff fixes this, by making sure the process is  
 threaded, before using the syscall.
 
 Note that the setjmp/longjmp code is only active if Python is 
 ./configure'd  with -with-fpectl, which has been standard for the 
 ports built Python for a long time.
 
 ISTR that this was because FreeBSD didn't mask SIGFPE by default, while
 Linux and many other OSes do.  I also seem to recall that this may have 
 changed in the evolution of 5.x.  If so, perhaps use of this configure
 option in the port needs to be reviewed for 5.x and later.

Well, I don't know what else it breaks, but for this microbenchmark,
compiling python-2.4.1 without -with-fpectl works swimmingly well
for me.  Not only does it bring the system time way down, but the user
time is down too, to about 5/7 of its previous value:

5.3-RELEASE / without -with-fpectl

   48.78 real48.22 user 0.15 sys
 23372  maximum resident set size
   657  average shared memory size
 20817  average unshared data size
   128  average unshared stack size
  5402  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
  4889  involuntary context switches

compared with 

5.3-RELEASE / with -with-fpectl

  106.59 real67.25 user38.57 sys
 23140  maximum resident set size
   660  average shared memory size
 20818  average unshared data size
   128  average unshared stack size
  5402  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 10678  involuntary context switches

I tentatively second Andrew's proposal that the use of this configure
option in the port needs to be reviewed for 5.x and later, pending
independent confirmation of the efficacy of this fix.

-e
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Scott Long
Ewan Todd wrote:
Hi All,
I have what I think is a serious performance issue with fbsd 5.3
release.  I've read about threading issues, and it seems to me that
that is what I'm looking at, but I'm not confident enough to rule out
that it might be a hardware issue, a kernel configuration issue, or
something to do with the python port.  I'd appreciate it if someone
would it point out if I'm overlooking something obvious.  Otherwise,
if it is the problem I think it is, then there seems entirely too
little acknowledgement of a major issue.
Here's the background.  I just got a new (to me) AMD machine and put
5.3 release on it.  I'd been very happy with the way my old Intel
machine had been performing with 4.10 stable, and I decided to run a
simple performance diagnostic on both machines, to wow myself with the
amazing performance of the new hardware / kernel combination.
However, the result was pretty disappointing.
Here are what I think are the pertinent dmesg details.
Old rig:
  FreeBSD 4.10-RELEASE #0: Thu Jul  1 22:47:08 EDT 2004
  Timecounter i8254  frequency 1193182 Hz
  Timecounter TSC  frequency 449235058 Hz
  CPU: Pentium III/Pentium III Xeon/Celeron (449.24-MHz 686-class CPU)
New rig:
  FreeBSD 5.3-RELEASE #0: Fri Nov  5 04:19:18 UTC 2004
  Timecounter i8254 frequency 1193182 Hz quality 0
  CPU: AMD Athlon(tm) Processor (995.77-MHz 686-class CPU)
  Timecounter ACPI-fast frequency 3579545 Hz quality 1000
  Timecounter TSC frequency 995767383 Hz quality 800
  Timecounters tick every 10.000 msec
The diagnostic I selected was a python program to generate 1 million
pseudo-random numbers and then to perform a heap sort on them.  That
code is included at the foot of this email.  I named the file
heapsort.py.  I ran it on both machines, using the time utility in
/usr/bin/ (not the builtin tcsh time).  So the command line was
  /usr/bin/time -al -o heapsort.data ./heapsort.py 100
A typical result for the old rig was
  130.78 real   129.86 user 0.11 sys
 22344  maximum resident set size
   608  average shared memory size
 20528  average unshared data size
   128  average unshared stack size
  5360  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
  2386  involuntary context switches
Whereas, the typical result for the new rig looked more like
  105.36 real71.10 user33.41 sys
 23376  maximum resident set size
   659  average shared memory size
 20796  average unshared data size
   127  average unshared stack size
  5402  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 10548  involuntary context switches
You'll notice that the new rig is indeed a little faster (times in
seconds): 105.36 real (new rig) compared with 130.78 real (old rig).
However, the new rig spends about 33.41 seconds on system overhead
compared with just 0.11 seconds on the old rig.  Comparing the rusage
stats, the only significant difference is the involuntary context
switches field, where the old rig has 2386 and the new rig has a
whopping 10548.  Further, I noticed that the number of context
switches on the new rig seems to be more or less exactly one per 10
msec of real time, that is, one per timecounter tick.  (I saw this
when comparing heapsort.py runs with arguments other than 100.)
I think the new rig ought to execute this task in about 70 seconds:
just over the amount of user time.  Assuming that I'm not overlooking
something obvious, and that I'm not interpreting a feature as a bug, 
this business with the context switches strikes me as a bit of a
show-stopper.  If that's right, it appears to be severely underplayed
in the release documentation.

I'll be happy if someone would kindly explain to me what's going on
here.  I'll be even happier to hear of a fix or workaround to remedy
the situation.
Thanks in advance,
-e

First of all, make sure that you have WITNESS and INVARIANTS off in your
kernel.  You might also want to recompile your kernel with the SMP 
option turned off.

Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Mike Tancsa
At 11:00 AM 09/05/2005, Ewan Todd wrote:
Here's the background.  I just got a new (to me) AMD machine and put
5.3 release on it.  I'd been very happy with the way my old Intel
There have been quite a few changes since 5.3. If you are starting fresh, I 
would strongly recommend going to 5.4 RC4. There have been a number of 
improvements that might help the problems you are seeing.

---Mike 

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Pete French
 Whereas, the typical result for the new rig looked more like

   105.36 real71.10 user33.41 sys
...
  10548  involuntary context switches

Now I just ran this test myself. This machine is a 2.4 gig P4 with
hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig
Athlon to be significantly slower than a 2.4 gig Pentium 4. 

but:

   93.45 real56.55 user36.85 sys
  1857  involuntary context switches

Uhhh... so it takes almost the same time to do the calculation, but spends
actually *more* of it in system space. Does far less context switches though,
but I am assuming thats due to HTT.

Numbers look very odd to me. So I then ran it on another P4 system we have
round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4
so it should be a bit faster, but:

   33.77 real33.49 user 0.07 sys
   711  involuntary context switches

Over two and a half times faster ?! Thats not right at all!

All the new systems I have tried are 5.4-RC4, so should be the
latest and greatest. When my colleague finishes on his machine I
can try a GENERIC 5.4-RC4 kernel on another P4 and see what that
gives.

-pcf.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Performance issue

2005-05-09 Thread Kipp Holger
Same test on a 5.3-STABLE from 31.01.2005:

   81,90 real77,05 user 3,51 sys
 22908  maximum resident set size
   620  average shared memory size
 20083  average unshared data size
   128  average unshared stack size
  5379  page reclaims
26  page faults
 0  swaps
36  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
62  voluntary context switches
 10623  involuntary context switches

This is a on a slow dual-processor system:

Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel Pentium III (732.98-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0x683  Stepping = 3
  Features=0x387fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,PN,MMX,FXSR,SSE
real memory  = 2281635840 (2175 MB)
avail memory = 2232012800 (2128 MB)
ACPI APIC Table: PTLTDAPIC  
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1

(if this is of any help). Scheduler is 4BSD.

Regards,
Holger Kipp

-Original Message-
From: [EMAIL PROTECTED] on behalf of Pete French
Sent: Mon 09.05.2005 18:10
To: [EMAIL PROTECTED]; freebsd-stable@freebsd.org
Subject: Re: Performance issue
 
 Whereas, the typical result for the new rig looked more like

   105.36 real71.10 user33.41 sys
...
  10548  involuntary context switches

Now I just ran this test myself. This machine is a 2.4 gig P4 with
hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig
Athlon to be significantly slower than a 2.4 gig Pentium 4. 

but:

   93.45 real56.55 user36.85 sys
  1857  involuntary context switches

Uhhh... so it takes almost the same time to do the calculation, but spends
actually *more* of it in system space. Does far less context switches though,
but I am assuming thats due to HTT.

Numbers look very odd to me. So I then ran it on another P4 system we have
round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4
so it should be a bit faster, but:

   33.77 real33.49 user 0.07 sys
   711  involuntary context switches

Over two and a half times faster ?! Thats not right at all!

All the new systems I have tried are 5.4-RC4, so should be the
latest and greatest. When my colleague finishes on his machine I
can try a GENERIC 5.4-RC4 kernel on another P4 and see what that
gives.

-pcf.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Ewan Todd
 
 Whereas, the typical result for the new rig looked more like
 
   105.36 real71.10 user33.41 sys
  ...
  10548  involuntary context switches
 
 
 
 First of all, make sure that you have WITNESS and INVARIANTS off in your
 kernel.  You might also want to recompile your kernel with the SMP 
 option turned off.
 
 Scott

First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete
French for running the test independently on the higher spec machines
with 5.4 RC4 on them, confirming the system time thing, ruling out an
AMD problem, dissociating the system time result from the context
switching, and saving me the trouble of rediscovering the same problem
on 5.4 RC4.  

This is my first foray into the public world of FreeBSD discussion
lists, and I am encouraged by the helpfulness of the response.

Scott, the 5.3 kernel I had was a essentially a GENERIC release
kernel, with about 100 options commented out.  WITNESS and INVARIANTS
are off by default, which I confirmed by looking through `sysctl -a`.
However, I was curious to see what I would get if I switched them on,
so I added these options and recompiled the kernel:

  options KDB
  options DDB
  options INVARIANTS
  options INVARIANT_SUPPORT
  options WITNESS
  options WITNESS_SKIPSPIN

The result, below, has essentially the same user time (or just less,
if that makes any sense), but tripled system time.  The context
switches are consistent with the one-per-10msec I saw before.  Is
there anything useful I can do while I still have the kernel debug
options on?

-e


  172.29 real67.53 user   103.07 sys
 23376  maximum resident set size
   659  average shared memory size
 20805  average unshared data size
   127  average unshared stack size
  5402  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 17234  involuntary context switches


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Scott Long
Ewan Todd wrote:
Whereas, the typical result for the new rig looked more like
105.36 real71.10 user33.41 sys
   ...
   10548  involuntary context switches

First of all, make sure that you have WITNESS and INVARIANTS off in your
kernel.  You might also want to recompile your kernel with the SMP 
option turned off.

Scott

First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete
French for running the test independently on the higher spec machines
with 5.4 RC4 on them, confirming the system time thing, ruling out an
AMD problem, dissociating the system time result from the context
switching, and saving me the trouble of rediscovering the same problem
on 5.4 RC4.  

This is my first foray into the public world of FreeBSD discussion
lists, and I am encouraged by the helpfulness of the response.
Scott, the 5.3 kernel I had was a essentially a GENERIC release
kernel, with about 100 options commented out.  WITNESS and INVARIANTS
are off by default, which I confirmed by looking through `sysctl -a`.
However, I was curious to see what I would get if I switched them on,
so I added these options and recompiled the kernel:
  options KDB
  options DDB
  options INVARIANTS
  options INVARIANT_SUPPORT
  options WITNESS
  options WITNESS_SKIPSPIN
The result, below, has essentially the same user time (or just less,
if that makes any sense), but tripled system time.  The context
switches are consistent with the one-per-10msec I saw before.  Is
there anything useful I can do while I still have the kernel debug
options on?
-e
  172.29 real67.53 user   103.07 sys
 23376  maximum resident set size
   659  average shared memory size
 20805  average unshared data size
   127  average unshared stack size
  5402  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 17234  involuntary context switches

5.3 ships with SMP turned on, which makes lock operations rather 
expensive on single-processor machines.  4.x does not have SMP
turned on by default.  Would you be able to re-run your test with
SMP turned off?

Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Ewan Todd
 
 5.3 ships with SMP turned on, which makes lock operations rather 
 expensive on single-processor machines.  4.x does not have SMP
 turned on by default.  Would you be able to re-run your test with
 SMP turned off?
 

I'm pretty sure there's no SMP in this kernel.

  #cd /usr/src/sys/i386/conf
  #fgrep SMP MYKERNEL
  #

GENERIC has no SMP in it, but there's a second GENERIC kernel conf
called SMP, which simply says:

  include GENERIC
  options SMP

However, sysctl seems to show smp not active, but not disabled.   Is
that anything to worry about?

  #sysctl -a | grep smp
  kern.smp.maxcpus: 1
  kern.smp.active: 0
  kern.smp.disabled: 0
  kern.smp.cpus: 1
  debug.psmpkterrthresh: 2


-e

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Scott Long
Ewan Todd wrote:
5.3 ships with SMP turned on, which makes lock operations rather 
expensive on single-processor machines.  4.x does not have SMP
turned on by default.  Would you be able to re-run your test with
SMP turned off?


I'm pretty sure there's no SMP in this kernel.
  #cd /usr/src/sys/i386/conf
  #fgrep SMP MYKERNEL
  #
GENERIC has no SMP in it, but there's a second GENERIC kernel conf
called SMP, which simply says:
  include GENERIC
  options SMP
However, sysctl seems to show smp not active, but not disabled.   Is
that anything to worry about?
  #sysctl -a | grep smp
  kern.smp.maxcpus: 1
  kern.smp.active: 0
  kern.smp.disabled: 0
  kern.smp.cpus: 1
  debug.psmpkterrthresh: 2
-e
Bah, you're right, sorry for the confusion.  Too many releases in my
mind, they all seem like a blur.
Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Pete French
 5.3 ships with SMP turned on, which makes lock operations rather 
 expensive on single-processor machines.  4.x does not have SMP
 turned on by default.  Would you be able to re-run your test with
 SMP turned off?

I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the
following result:

   67.52 real41.13 user26.16 sys
  7034  involuntary context switches

i.e. it still has system time a a huge proportion of the total compared
to the 4.11 kernel. Interesingly, after reading Holger Kipp's results
I tried it on a genuine multi-processor box with SMP enabled running 5.3.
He got a very small percentage of the time in sys (3.51 out of 81.90) but
I got:
  255.30 real   160.20 user88.50 sys

Once again a far higher proprtion of the time spent in sys than you would
expect.

-pcf.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Alexander S. Usov
Scott Long wrote:
 First of all, make sure that you have WITNESS and INVARIANTS off in your
 kernel.  You might also want to recompile your kernel with the SMP
 option turned off.

I can confirm this.
I just rerun it on RELENG_5_4 as of yesterday and got
  136.52 real80.29 user50.16 sys
 23212  maximum resident set size
   674  average shared memory size
 20961  average unshared data size
   128  average unshared stack size
  5419  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 25738  involuntary context switches

No debugging or SMP in kernel.

-- 
Best regards,
  Alexander.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Suleiman Souhlal
Hello,
On May 9, 2005, at 1:31 PM, Pete French wrote:
5.3 ships with SMP turned on, which makes lock operations rather
expensive on single-processor machines.  4.x does not have SMP
turned on by default.  Would you be able to re-run your test with
SMP turned off?
I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I  
got the
following result:

   67.52 real41.13 user26.16 sys
  7034  involuntary context switches
i.e. it still has system time a a huge proportion of the total  
compared
to the 4.11 kernel. Interesingly, after reading Holger Kipp's results
I tried it on a genuine multi-processor box with SMP enabled  
running 5.3.
He got a very small percentage of the time in sys (3.51 out of  
81.90) but
I got:
  255.30 real   160.20 user88.50 sys

Once again a far higher proprtion of the time spent in sys than you  
would
expect.
I ran ktrace(1) on it, and it appears that python keeps calling  
sigprocmask() continually:

   673 python   0.07 CALL  sigprocmask(0x3,0,0x811d11c)
   673 python   0.05 RET   sigprocmask 0
   673 python   0.09 CALL  sigprocmask(0x1,0,0x8113d1c)
   673 python   0.05 RET   sigprocmask 0
etc..
This explains why it's using so much system time. Now the question is  
why is python doing this?

--
Suleiman Souhlal | [EMAIL PROTECTED]
The FreeBSD Project  | [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Mike Jakubik
On Mon, May 9, 2005 1:06 pm, Scott Long said:

 5.3 ships with SMP turned on, which makes lock operations rather
 expensive on single-processor machines.  4.x does not have SMP turned on by
 default.  Would you be able to re-run your test with SMP turned off?

This is what i get on my system, which has debugging and smp off in the
kernel.

FreeBSD 6.0-CURRENT #0: Tue May  3 23:55:43 EDT 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/DP
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) Processor (1410.21-MHz 686-class CPU)
  Origin = AuthenticAMD  Id = 0x644  Stepping = 4
  
Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR
  AMD Features=0xc044RSVD,AMIE,DSP,3DNow!

---
   76.89 real49.33 user22.87 sys
 23116  maximum resident set size
   686  average shared memory size
 20795  average unshared data size
   127  average unshared stack size
  5380  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 1  voluntary context switches
 10018  involuntary context switches
---

As we can see, it is still spending a lot of time in system, and there are
a lot of context switches being done.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Mon, 9 May 2005, Suleiman Souhlal wrote:
 Hello,


 I ran ktrace(1) on it, and it appears that python keeps calling
 sigprocmask() continually:

 673 python   0.07 CALL  sigprocmask(0x3,0,0x811d11c)
 673 python   0.05 RET   sigprocmask 0
 673 python   0.09 CALL  sigprocmask(0x1,0,0x8113d1c)
 673 python   0.05 RET   sigprocmask 0
 etc..

 This explains why it's using so much system time. Now the question is
 why is python doing this?

I don't know what python's doing, but it might not be calling
sigprocmask directly.  There are a few libc functions that use
sigprocmask:

db/btree/
db/hash/
pselect(),
setmode(),
{sig}setjmp(), {sig}longjmp(),
grantpt(),
system()

to name a few.  Python may also be using other libraries which
use sigprocmask().

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Jonathan Noack
On 5/9/2005 12:31 PM, Pete French wrote:
5.3 ships with SMP turned on, which makes lock operations rather 
expensive on single-processor machines.  4.x does not have SMP
turned on by default.  Would you be able to re-run your test with
SMP turned off?

I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the
following result:
   67.52 real41.13 user26.16 sys
  7034  involuntary context switches
i.e. it still has system time a a huge proportion of the total compared
to the 4.11 kernel. Interesingly, after reading Holger Kipp's results
I tried it on a genuine multi-processor box with SMP enabled running 5.3.
He got a very small percentage of the time in sys (3.51 out of 81.90) but
I got:
  255.30 real   160.20 user88.50 sys
Once again a far higher proprtion of the time spent in sys than you would
expect.
I ran into a similar issue when attempting to thread a card game solver 
program I wrote.  Performance in early versions was horrific and I 
noticed tons of context switches.  I resolved the issue by allocating 
pools of memory beforehand.  This seems to point the finger to malloc 
and context switch overhead.

In any case, I believe this is related to threading.  Check your results 
with libthr instead.  The following are on my 2.53 GHz P4 which is 
running CURRENT from last night (with INVARIANTS on).

libpthread:
$ /usr/bin/time -al ./heapsort.py 100
0.928555
  124.04 real65.71 user48.47 sys
 23464  maximum resident set size
   680  average shared memory size
 21104  average unshared data size
   129  average unshared stack size
  5400  page reclaims
 0  page faults
 0  swaps
15  block input operations
 0  block output operations
 4  messages sent
 0  messages received
 0  signals received
21  voluntary context switches
 40274  involuntary context switches
libthr:
$ /usr/bin/time -al ./heapsort.py 100
0.928555
   79.75 real50.63 user25.34 sys
 23348  maximum resident set size
   679  average shared memory size
 21041  average unshared data size
   129  average unshared stack size
  5394  page reclaims
 1  page faults
 0  swaps
 2  block input operations
 0  block output operations
 3  messages sent
 0  messages received
 0  signals received
 7  voluntary context switches
 26113  involuntary context switches
--
Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195


signature.asc
Description: OpenPGP digital signature


Re: Performance issue

2005-05-09 Thread Peter Jeremy
On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote:
I have what I think is a serious performance issue with fbsd 5.3
release.  I've read about threading issues, and it seems to me that
that is what I'm looking at, but I'm not confident enough to rule out
that it might be a hardware issue, a kernel configuration issue, or
something to do with the python port.

There does appear to be a problem in FreeBSD.  Python is built with
threading enabled by default, the threading libraries play with the
signal mask and there have been extensive changes there.  My
suggestions on things you could check are:
1) Rebuild python with threading disabled (add '-DWITHOUT_THREADS' to the
   'make' command line and see if that makes any difference
2) Re-write the sample program in a non-threaded language - eg C or perl
   and see if the high system time goes away.

Unfortunately, I can't think of a solution at present.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Tue, 10 May 2005, Peter Jeremy wrote:

 On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote:
 I have what I think is a serious performance issue with fbsd 5.3
 release.  I've read about threading issues, and it seems to me that
 that is what I'm looking at, but I'm not confident enough to rule out
 that it might be a hardware issue, a kernel configuration issue, or
 something to do with the python port.

 There does appear to be a problem in FreeBSD.  Python is built with
 threading enabled by default, the threading libraries play with the
 signal mask and there have been extensive changes there.  My

The threading libraries don't play with the signal mask.  In fact,
libpthread has userland versions of sigprocmask() et. al. and won't
even make the syscall() unless the threads are system scope.  There
is a special thread in libpthread that handles signals which does
use the system sigprocmask(), but unless the application is
making heavy use of signals in general, it shouldn't matter.

 suggestions on things you could check are:
 1) Rebuild python with threading disabled (add '-DWITHOUT_THREADS' to the
'make' command line and see if that makes any difference
 2) Re-write the sample program in a non-threaded language - eg C or perl
and see if the high system time goes away.

 Unfortunately, I can't think of a solution at present.

You can also set LIBPTHREAD_SYSTEM_SCOPE in the environment to
force libpthread to use system scope threads.  It uses process
scope threads by default.

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Matthias Buelow
Daniel Eischen wrote:

   {sig}setjmp(), {sig}longjmp(),

A very wild guess.. python is using setjmp/longjmp to implement
continuations, tailcalls, or any mechanism similar to that and using
that in a loop?

mkb.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Jonathan Noack
On 5/9/2005 1:30 PM, Jonathan Noack wrote:
On 5/9/2005 12:31 PM, Pete French wrote:
5.3 ships with SMP turned on, which makes lock operations rather 
expensive on single-processor machines.  4.x does not have SMP
turned on by default.  Would you be able to re-run your test with
SMP turned off?
I just ran a test here with SMP turned of on 5.4-RC4 (GENERIC) I got the
following result:
   67.52 real41.13 user26.16 sys
  7034  involuntary context switches
i.e. it still has system time a a huge proportion of the total compared
to the 4.11 kernel. Interesingly, after reading Holger Kipp's results
I tried it on a genuine multi-processor box with SMP enabled running 5.3.
He got a very small percentage of the time in sys (3.51 out of 81.90) but
I got:
  255.30 real   160.20 user88.50 sys
Once again a far higher proprtion of the time spent in sys than you would
expect.
I ran into a similar issue when attempting to thread a card game solver 
program I wrote.  Performance in early versions was horrific and I 
noticed tons of context switches.  I resolved the issue by allocating 
pools of memory beforehand.  This seems to point the finger to malloc 
and context switch overhead.

In any case, I believe this is related to threading.  Check your results 
with libthr instead.  The following are on my 2.53 GHz P4 which is 
running CURRENT from last night (with INVARIANTS on).

libpthread:
$ /usr/bin/time -al ./heapsort.py 100
0.928555
  124.04 real65.71 user48.47 sys
 23464  maximum resident set size
   680  average shared memory size
 21104  average unshared data size
   129  average unshared stack size
  5400  page reclaims
 0  page faults
 0  swaps
15  block input operations
 0  block output operations
 4  messages sent
 0  messages received
 0  signals received
21  voluntary context switches
 40274  involuntary context switches
libthr:
$ /usr/bin/time -al ./heapsort.py 100
0.928555
   79.75 real50.63 user25.34 sys
 23348  maximum resident set size
   679  average shared memory size
 21041  average unshared data size
   129  average unshared stack size
  5394  page reclaims
 1  page faults
 0  swaps
 2  block input operations
 0  block output operations
 3  messages sent
 0  messages received
 0  signals received
 7  voluntary context switches
 26113  involuntary context switches
Oooh... same machine with libc_r:
$ /usr/bin/time -al ./heapsort.py 100
0.928555
   38.72 real36.85 user 0.06 sys
 23496  maximum resident set size
   678  average shared memory size
 21126  average unshared data size
   129  average unshared stack size
  5418  page reclaims
 2  page faults
 0  swaps
 2  block input operations
 0  block output operations
 3  messages sent
 0  messages received
 0  signals received
 8  voluntary context switches
 13137  involuntary context switches
--
Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195


signature.asc
Description: OpenPGP digital signature


Re: Performance issue

2005-05-09 Thread Suleiman Souhlal
Hello,
On May 9, 2005, at 3:54 PM, Daniel Eischen wrote:
On Tue, 10 May 2005, Peter Jeremy wrote:

On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote:
I have what I think is a serious performance issue with fbsd 5.3
release.  I've read about threading issues, and it seems to me that
that is what I'm looking at, but I'm not confident enough to rule  
out
that it might be a hardware issue, a kernel configuration issue, or
something to do with the python port.

There does appear to be a problem in FreeBSD.  Python is built with
threading enabled by default, the threading libraries play with the
signal mask and there have been extensive changes there.  My
The threading libraries don't play with the signal mask.  In fact,
libpthread has userland versions of sigprocmask() et. al. and won't
even make the syscall() unless the threads are system scope.  There
is a special thread in libpthread that handles signals which does
use the system sigprocmask(), but unless the application is
making heavy use of signals in general, it shouldn't matter.
I think I've found the problem: Python uses setjmp/longjmp to protect  
against SIGFPU every time it does floating point operations. The  
python script does not actually use threads, and libpthread assumes  
non-threaded processes are system scope. So, it would end up using  
the sigprocmask syscall, even though it doesn't really need to.
The diff at http://people.freebsd.org/~ssouhlal/testing/ 
thr_sigmask-20050509.diff fixes this, by making sure the process is  
threaded, before using the syscall.

--
Suleiman Souhlal | [EMAIL PROTECTED]
The FreeBSD Project  | [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Mon, 9 May 2005, Suleiman Souhlal wrote:
 On May 9, 2005, at 3:54 PM, Daniel Eischen wrote:

  The threading libraries don't play with the signal mask.  In fact,
  libpthread has userland versions of sigprocmask() et. al. and won't
  even make the syscall() unless the threads are system scope.  There
  is a special thread in libpthread that handles signals which does
  use the system sigprocmask(), but unless the application is
  making heavy use of signals in general, it shouldn't matter.

 I think I've found the problem: Python uses setjmp/longjmp to protect
 against SIGFPU every time it does floating point operations. The
 python script does not actually use threads, and libpthread assumes
 non-threaded processes are system scope. So, it would end up using
 the sigprocmask syscall, even though it doesn't really need to.
 The diff at http://people.freebsd.org/~ssouhlal/testing/
 thr_sigmask-20050509.diff fixes this, by making sure the process is
 threaded, before using the syscall.

I don't think that patch is correct.  You need the signal mask
in the kernel to match in case of an exec() after a fork()
for instance.  If the application fork()'s, then changes the
signal mask in the child (which is now single threaded), then
the child exec()s, the mask is wrong.

If the process wasn't linked to libpthread, then the longjmp()
and setjmp() would still be calling the syscall, so it isn't
the syscall itself that is making things slower.  You'll notice
that there are two calls to __sys_sigprocmask() in the section
of code you have patched.  You could eliminate the second call
if you do some of what the remainder of the function does instead
of returning early (the locks aren't needed and pending signals
don't need to be run down).

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Mon, 9 May 2005, Daniel Eischen wrote:

 On Mon, 9 May 2005, Suleiman Souhlal wrote:
  I think I've found the problem: Python uses setjmp/longjmp to protect
  against SIGFPU every time it does floating point operations. The
  python script does not actually use threads, and libpthread assumes
  non-threaded processes are system scope. So, it would end up using
  the sigprocmask syscall, even though it doesn't really need to.
  The diff at http://people.freebsd.org/~ssouhlal/testing/
  thr_sigmask-20050509.diff fixes this, by making sure the process is
  threaded, before using the syscall.

[ ... ]

 If the process wasn't linked to libpthread, then the longjmp()
 and setjmp() would still be calling the syscall, so it isn't
 the syscall itself that is making things slower.  You'll notice
 that there are two calls to __sys_sigprocmask() in the section
 of code you have patched.  You could eliminate the second call
 if you do some of what the remainder of the function does instead
 of returning early (the locks aren't needed and pending signals
 don't need to be run down).

As in something like this:

  http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs

It has not been tested.

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Jonathan Noack
On 05/09/05 18:47, Daniel Eischen wrote:
On Mon, 9 May 2005, Daniel Eischen wrote:
On Mon, 9 May 2005, Suleiman Souhlal wrote:
I think I've found the problem: Python uses setjmp/longjmp to protect
against SIGFPU every time it does floating point operations. The
python script does not actually use threads, and libpthread assumes
non-threaded processes are system scope. So, it would end up using
the sigprocmask syscall, even though it doesn't really need to.
The diff at http://people.freebsd.org/~ssouhlal/testing/
thr_sigmask-20050509.diff fixes this, by making sure the process is
threaded, before using the syscall.
[ ... ]
If the process wasn't linked to libpthread, then the longjmp()
and setjmp() would still be calling the syscall, so it isn't
the syscall itself that is making things slower.  You'll notice
that there are two calls to __sys_sigprocmask() in the section
of code you have patched.  You could eliminate the second call
if you do some of what the remainder of the function does instead
of returning early (the locks aren't needed and pending signals
don't need to be run down).
As in something like this:
  http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs
It has not been tested.
When I tried to test this every threaded program died with sig 11.  Does 
this require me to recompile the program before it will work?

--
Jonathan Noack | [EMAIL PROTECTED] | OpenPGP: 0x991D8195


signature.asc
Description: OpenPGP digital signature


Re: Performance issue

2005-05-09 Thread Suleiman Souhlal
Hello,
On May 9, 2005, at 7:21 PM, Daniel Eischen wrote:
I don't think that patch is correct.  You need the signal mask
in the kernel to match in case of an exec() after a fork()
for instance.  If the application fork()'s, then changes the
signal mask in the child (which is now single threaded), then
the child exec()s, the mask is wrong.
If the process wasn't linked to libpthread, then the longjmp()
and setjmp() would still be calling the syscall, so it isn't
the syscall itself that is making things slower.  You'll notice
that there are two calls to __sys_sigprocmask() in the section
of code you have patched.  You could eliminate the second call
if you do some of what the remainder of the function does instead
of returning early (the locks aren't needed and pending signals
don't need to be run down).
Processes linked with libc_r NEVER call the syscall, once they have  
started (after rtld-elf):

zZzZ:~/py% LD_LIBMAP=libpthread.so.1=libc_r.so.5 ktrace -t c python  
heapsort.py 1  /dev/null  kdump -T | grep sigprocmask
  2991 python   1115698354.240301 CALL  sigprocmask 
(0x1,0x2810a820,0xbfbfea60)
  2991 python   1115698354.240304 RET   sigprocmask 0
  2991 python   1115698354.240307 CALL  sigprocmask(0x3,0x2810a830,0)
  2991 python   1115698354.240308 RET   sigprocmask 0
zZzZ:~/py%

compare with libpthread:
zZzZ:~/py% ktrace -t c python heapsort.py 1  /dev/null  kdump - 
T | grep -c sigprocmask
92114
zZzZ:~/py%

Is this a bug in libc_r?
--
Suleiman Souhlal | [EMAIL PROTECTED]
The FreeBSD Project  | [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Mon, 9 May 2005, Jonathan Noack wrote:

 On 05/09/05 18:47, Daniel Eischen wrote:
 If the process wasn't linked to libpthread, then the longjmp()
 and setjmp() would still be calling the syscall, so it isn't
 the syscall itself that is making things slower.  You'll notice
 that there are two calls to __sys_sigprocmask() in the section
 of code you have patched.  You could eliminate the second call
 if you do some of what the remainder of the function does instead
 of returning early (the locks aren't needed and pending signals
 don't need to be run down).
 
  As in something like this:
 
http://people.freebsd.org/~deischen/kse/thr_sigmask.c.diffs
 
  It has not been tested.

 When I tried to test this every threaded program died with sig 11.  Does
 this require me to recompile the program before it will work?

No, the patch just must have a bug in it.

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance issue

2005-05-09 Thread Daniel Eischen
On Tue, 10 May 2005, Suleiman Souhlal wrote:

 Hello,

 On May 9, 2005, at 7:21 PM, Daniel Eischen wrote:

  I don't think that patch is correct.  You need the signal mask
  in the kernel to match in case of an exec() after a fork()
  for instance.  If the application fork()'s, then changes the
  signal mask in the child (which is now single threaded), then
  the child exec()s, the mask is wrong.
 
  If the process wasn't linked to libpthread, then the longjmp()
^^
or any thread library.

  and setjmp() would still be calling the syscall, so it isn't
  the syscall itself that is making things slower.  You'll notice
  that there are two calls to __sys_sigprocmask() in the section
  of code you have patched.  You could eliminate the second call
  if you do some of what the remainder of the function does instead
  of returning early (the locks aren't needed and pending signals
  don't need to be run down).

 Processes linked with libc_r NEVER call the syscall, once they have
 started (after rtld-elf):

[...]

 Is this a bug in libc_r?

No, libc_r wraps execve() and a lot of other syscalls that libpthread
or libthr don't need to.  Take a look at libc_r/uthread/uthread_execve.c
and you will see it sets the signal mask before exec()ing.

-- 
DE

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]