Re: Remaining CPU bottlenecks in 2.0, revisited

2001-09-29 Thread Brian Pane

Justin Erenkrantz wrote:

On Fri, Sep 28, 2001 at 04:57:05PM -0700, Brian Pane wrote:

The notes following the profile offer my interpretation of
the numbers.

 % of totalcumulative %
rank function CPU timeof total CPU time
 --   --  -
 1. munmap 18.16
 2. _writev14.00
 3. mmap   11.96
 4. _so_accept  5.71
 5. bndm4.82
 6. _close  4.76
 7. _so_shutdown4.16
 8. __open  3.18
 9. _write  2.61
10. memset  2.38
 71.74


Most of these look related to OS things that we can't control.


Yes.  Or, to put it another way, the implementation efficiency is
approaching its theoretical optimum for this architecture.  That's
not a bad thing. :-)

As you said, bndm is ~4x better than its predecessor.  Unless
we were to precompute mod_include files or rip out its parsing
algorithm with something better (which I think is possible),
I don't think there is much we can do here.  


Agreed, there probably isn't much more that can be done.  I think
2.0's SSI parser is very close to optimal now.

I'm surprised _os_accept, _close, and _so_shutdown are up there
in CPU usage.  I also know that gstein has gone record on his
advogato diary that mmap may not be the performance win we
think it may be.  It may be worth trying to compile without
mmap and see how we perform.  Also, I'm guessing this is on
Solaris 8?  Any chance we could get the sendfilev patch on 
there?  (I know it is part of 7/01 and Sol9 as well.)


I'll see if we can get some more comparative numbers this week...

Out of curiousity, how is the performance relative to 1.3
for similar loads or URLS?  -- justin


I just ran a quick comparison, using ab with concurrency==1
over the loopback on uniprocessor Linux (so these results show
speed, but not scalability).

  50KB non-SSI HTML file: 1141 requests/s with 2.0.26-dev
   948 requests/s with 1.3.20

  SHTML file with 3 includes:  411 requests/s with 2.0.26-dev
   (total size 50KB)   200 requests/s with 1.3.20

  0KB non-SSI HTML file:  2764 requests/s with 2.0.26-dev
  2937 requests/s with 1.3.20

My interpretation of these results is that 2.0's basic request
handling is a bit less efficient than 1.3, as shown on the
0KB test, but its use of sendfile for non-SSI requests and
much-improved mod_include code for SSI requests gives it an
advantage in the general case.

--Brian





Re: remaining CPU bottlenecks in 2.0

2001-09-06 Thread Aaron Bannert

On Wed, Sep 05, 2001 at 08:40:24PM -0700, Justin Erenkrantz wrote:
 On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote:
  I got 1 more question about the solaris implementation
  of the Threaded/Worker MPM.
  
  
  should we be called the setconcurrency flag on startup ?
  I know solaris figures it out along the way, but a bit of gentle
  prodding never hurt.
 
 Aaron and I went back and forth on this.  He was of the mind that
 the user scheduler should figure it out after a period of time
 (and that we shouldn't muck with it).

Actually, I seem to recall suggesting that we add something like
apr_thread_setconcurrency() that attempts to do the right thing if the
platform supports it (and if it doesn't, it's probably no big deal),
and is otherwise just a noop. This may help solaris adapt quicker to a
fast ramp-up of userspace-to-LWP thread mapping, especially when there
are many compute-bound threads that are initially unwilling to yield. I
have a feeling this may be more important on SMP boxes, where threads
should be migrated to become bound threads as soon as possible.


 The badness happens in testthread.c - Solaris executes that code
 in serial since the user scheduler never has a chance to 
 interrupt and balance the threads.  But, in the real code, 
 Solaris balances it since we hit syscalls and mutexes quite often.
 
 An equivalent to calling pthread_setconcurrency is to use
 /usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so.

ISTR that this will cause every thread to become a bound thread, effectively
eliminating the [theoretical] benefits of a partial userspace thread impl.

-aaron



Re: remaining CPU bottlenecks in 2.0

2001-09-06 Thread William A. Rowe, Jr.

From: dean gaudet [EMAIL PROTECTED]
Sent: Wednesday, September 05, 2001 9:17 PM


 On Tue, 4 Sep 2001, Brian Pane wrote:
 
  * memset() is called mostly from apr_pcalloc(), which in turn is
used in too many places to yield any easy optimization opportunities.
 
 sometimes folks are lazy and ask for zeroed memory out of habit, when they
 could easily deal with garbage at less cost.

True.  Or folks are being pedantic and summarizing a dozen NULL assignments
immediately after an alloc - I'd suggest that group takes a leap into a commented
block just for clarity, and we skip initializing long lists of NULL values.
Use pcalloc where the majority of the allocation is zeroed, and palloc everywhere
else.




mod_include.c WAS: RE: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Sander Striker

 * find_start_sequence() is the main scanning function within
   mod_include.  There's some research in progress to try to speed
   this up significantly.
 
 Based on the patches you submitted (and my quasi-errant formatting
 patch), I had to read most of the code in mod_include, so I'm more 
 familiar with mod_include now.  I do think there are some obvious 
 ways to optimize find_start_sequence.  I wonder if we could apply 
 a KMP-string matching algorithm here.  I dunno.  I'll take a look 
 at it though.  Something bugs me about the restarts.  I bet that 
 we spend even more time in find_start_sequence when a HTML file 
 has lots of comments.  =-)

I suggested to Ian yesterday night that I'd look into a Boyer-Moore
matching algorithm.  I'll work on that too.

Sander




Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Greg Stein

On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote:
...
 You were discussing the possibility of parsing for !--# as a skip by 5.
 
 Consider jumping to a 4 byte alignment, truncating to char and skip by
 dwords.  E.g., you only have to test for three values, not four, and you
 can use the machine's most optimal path.  But I'd ask if strstr() isn't
 optimized on the platform, why are we reinventing it?

strstr() can't find strings that span a bucket boundary.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Brian Pane

Greg Stein wrote:

On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote:

...
You were discussing the possibility of parsing for !--# as a skip by 5.

Consider jumping to a 4 byte alignment, truncating to char and skip by
dwords.  E.g., you only have to test for three values, not four, and you
can use the machine's most optimal path.  But I'd ask if strstr() isn't
optimized on the platform, why are we reinventing it?


strstr() can't find strings that span a bucket boundary.

In addition, the most promising search algorithms for this problem
require a lookup table that contains metadata about the search pattern.
We can save time by precomputing the lookup table for !--# at server
startup (or even at compile time), which is something that strstr()
can't do.

--Brian






Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Brian Pane

Justin Erenkrantz wrote:

[...]

* The discussion here covers only CPU utilization.  There are other
  aspects of performance, like multiprocessor scalability, that
  are independent of this data.


Once we get the syscalls optimized (I'm reminded of Dean's attack
on our number of syscalls in 1.3 - I believe he went through syscall
by syscall trying to eliminate all of the unnecessary ones), I think 
the next performance point will be MP scalability (see below for
lock scalability on solaris).  But, we do need to see what we can 
do about optimizing the syscalls though...

The syscall profile for 2.0 looks close to optimal (as long as you
disable .htaccess files and symlink checking).  We're doing a few
too many gettimeofday calls per request, though; I 'll investigate
whether any of these can be optimized away.  (Fortunately, gettimeofday
is cheap compared to time(2).)

[...]

* _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
  but only from a small fraction of pthread_mutex_unlock calls
  (Can someone familiar with Solaris threading internals explain
  this one?)


The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
LWP scheduler is a user-space library so it gets thrown in
with our numbers I bet.

Here's some background on Solaris's implementation that I
think may provide some useful information as to how the locks 
will perform overall.  (If you spot any inconsistencies, it is
probably my fault...I'm going to try to explain this as best as
I can...)

First off, Solaris has adaptive locks.  Depending if the owner of 
the lock is currently active, it will spin.  If the system sees
that the owner of the held lock is not currently active, it will
sleep (they call this an adaptive lock - it now enters a turnstile).

Ah, I guess that explains why only a small fraction of pthread_mutex_lock
calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
case where the lock is available, it just stays in user-mode code?

--Brian





Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Justin Erenkrantz

On Wed, Sep 05, 2001 at 08:19:50PM -0700, Brian Pane wrote:
 Ah, I guess that explains why only a small fraction of pthread_mutex_lock
 calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
 case where the lock is available, it just stays in user-mode code?

Yes.  -- justin




Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Ian Holsman

I got 1 more question about the solaris implementation
of the Threaded/Worker MPM.


should we be called the setconcurrency flag on startup ?
I know solaris figures it out along the way, but a bit of gentle
prodding never hurt.

..Ian
Brian Pane wrote:

 Justin Erenkrantz wrote:
 
 [...]
 
 * The discussion here covers only CPU utilization.  There are other
  aspects of performance, like multiprocessor scalability, that
  are independent of this data.


 Once we get the syscalls optimized (I'm reminded of Dean's attack
 on our number of syscalls in 1.3 - I believe he went through syscall
 by syscall trying to eliminate all of the unnecessary ones), I think 
 the next performance point will be MP scalability (see below for
 lock scalability on solaris).  But, we do need to see what we can do 
 about optimizing the syscalls though...

 The syscall profile for 2.0 looks close to optimal (as long as you
 disable .htaccess files and symlink checking).  We're doing a few
 too many gettimeofday calls per request, though; I 'll investigate
 whether any of these can be optimized away.  (Fortunately, gettimeofday
 is cheap compared to time(2).)
 
 [...]
 
 * _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
  but only from a small fraction of pthread_mutex_unlock calls
  (Can someone familiar with Solaris threading internals explain
  this one?)


 The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
 LWP scheduler is a user-space library so it gets thrown in
 with our numbers I bet.

 Here's some background on Solaris's implementation that I
 think may provide some useful information as to how the locks will 
 perform overall.  (If you spot any inconsistencies, it is
 probably my fault...I'm going to try to explain this as best as
 I can...)

 First off, Solaris has adaptive locks.  Depending if the owner of the 
 lock is currently active, it will spin.  If the system sees
 that the owner of the held lock is not currently active, it will
 sleep (they call this an adaptive lock - it now enters a turnstile).

 Ah, I guess that explains why only a small fraction of pthread_mutex_lock
 calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast
 case where the lock is available, it just stays in user-mode code?
 
 --Brian
 






Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Justin Erenkrantz

On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote:
 I got 1 more question about the solaris implementation
 of the Threaded/Worker MPM.
 
 
 should we be called the setconcurrency flag on startup ?
 I know solaris figures it out along the way, but a bit of gentle
 prodding never hurt.

Aaron and I went back and forth on this.  He was of the mind that
the user scheduler should figure it out after a period of time
(and that we shouldn't muck with it).

The badness happens in testthread.c - Solaris executes that code
in serial since the user scheduler never has a chance to 
interrupt and balance the threads.  But, in the real code, 
Solaris balances it since we hit syscalls and mutexes quite often.

An equivalent to calling pthread_setconcurrency is to use
/usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so.
-- justin




Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread dean gaudet

On Tue, 4 Sep 2001, Brian Pane wrote:

 * Collectively, stat and open comprise 5% of the total CPU time.
   It would be faster to do open+fstat rather than stat+open (as
   long as the server is delivering mostly 200s rather than 304s),
   but that might be too radical a change.  Anybody have thoughts
   on this?

linux' dcache makes either method about the same cost, but on other
kernels, and especially when NFS is involved (because of how NFS path
resolution occurs), the open/fstat is much better.  the trick would be to
cache open filehandles so that fstat can be used.

 * memset() is called mostly from apr_pcalloc(), which in turn is
   used in too many places to yield any easy optimization opportunities.

sometimes folks are lazy and ask for zeroed memory out of habit, when they
could easily deal with garbage at less cost.

-dean




Re: remaining CPU bottlenecks in 2.0

2001-09-05 Thread Brian Pane

dean gaudet wrote:

[...]

* memset() is called mostly from apr_pcalloc(), which in turn is
  used in too many places to yield any easy optimization opportunities.


sometimes folks are lazy and ask for zeroed memory out of habit, when they
could easily deal with garbage at less cost.

Some good news here: I dug through the profile data and found that over half
the httpd's calls to apr_pcalloc are from one function, 'find_entry' in the
apr_hash_set implementation.  It uses apr_pcalloc to allocate a struct with
5 fields, and then it immediately overwrites 4 of those fields.  I'll submit
a patch to the APR dev list to fix this.

--Brian





Re: remaining CPU bottlenecks in 2.0

2001-09-04 Thread Justin Erenkrantz

On Tue, Sep 04, 2001 at 08:00:35PM -0700, Brian Pane wrote:
 I'm currently studying profiling data from an httpd built from
 a CVS snapshot earlier today.
 
 In general, the performance of 2.0 is starting to look good.

Cool.  This probably means the code is starting to look good, too.

 * The discussion here covers only CPU utilization.  There are other
   aspects of performance, like multiprocessor scalability, that
   are independent of this data.

Once we get the syscalls optimized (I'm reminded of Dean's attack
on our number of syscalls in 1.3 - I believe he went through syscall
by syscall trying to eliminate all of the unnecessary ones), I think 
the next performance point will be MP scalability (see below for
lock scalability on solaris).  But, we do need to see what we can 
do about optimizing the syscalls though...

 * find_start_sequence() is the main scanning function within
   mod_include.  There's some research in progress to try to speed
   this up significantly.

Based on the patches you submitted (and my quasi-errant formatting
patch), I had to read most of the code in mod_include, so I'm more 
familiar with mod_include now.  I do think there are some obvious 
ways to optimize find_start_sequence.  I wonder if we could apply 
a KMP-string matching algorithm here.  I dunno.  I'll take a look 
at it though.  Something bugs me about the restarts.  I bet that 
we spend even more time in find_start_sequence when a HTML file 
has lots of comments.  =-)

 * strlen() is called from lots of places throughout the code, with
   the most frequent calls being from apr_pstrdup, apr_pstrcat, and
   time-formatting functions used in apr_rfc822_date.

I think someone has brought up that apr_pstrdup does an extra strlen.
I'll have to review that code.

 * _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
   but only from a small fraction of pthread_mutex_unlock calls
   (Can someone familiar with Solaris threading internals explain
   this one?)

The LWP scheduler may also call _lwp_mutex_unlock() implicitly -
LWP scheduler is a user-space library so it gets thrown in
with our numbers I bet.

Here's some background on Solaris's implementation that I
think may provide some useful information as to how the locks 
will perform overall.  (If you spot any inconsistencies, it is
probably my fault...I'm going to try to explain this as best as
I can...)

First off, Solaris has adaptive locks.  Depending if the owner of 
the lock is currently active, it will spin.  If the system sees
that the owner of the held lock is not currently active, it will
sleep (they call this an adaptive lock - it now enters a turnstile).

Okay, so what happens when a mutex unlocks?  This depends on
whether you are in spin or adaptive lock.  Spin locks 
immediately see the freed lock and the first one on the CPU
grabs it (excluding priority inversion here.).

But, the more interesting issue is what happens when we are in
an adaptive lock.  According to Mauro, Solaris 7+ has a thundering 
herd condition for adaptive kernel locks.  It will wake up *all* 
waiters and lets them fight it out.  This was a difference from 
Solaris 2.5.1 and 2.6.  

Since you should never have a lot of threads sitting in a mutex 
(according to Sun, it is typical in practice to only have one 
kernel thread waiting), this thundering herd is okay and 
actually performs better than freeing the lock and only waking 
up one waiter.  They say it makes the code much cleaner.  I'm
betting we're overusing the mutexes which changes the equation
considerably.  =-)

Okay, that tells you how it is done in kernel-space.  For
user-space (see below how to remove user-space threading), 
it is slightly different.  Remember in Solaris, we have
a two-tier thread-model - user-space and kernel threads.

In user-space, when we call pthread_mutex_unlock, we hit 
_mutex_unlock in liblwp, which calls mutex_unlock_adaptive 
as we aren't a special case lock.

So, what does mutex_unlock_adaptive do in lwp?  In pseudocode (as 
best as I can explain it), it first tries to see if there is a 
waiter in the current LWP, if so, it clears the lock bit and the 
other thread in the same LWP takes it.  If no thread in the same 
LWP is waiting, it will then sleep for ~500 while loop 
iterations to let another thread in the same LWP take the thread.
(On a UP box, it'll exit here before doing the while loop as it 
knows that spinlocks are stupid.  I think you were testing on a 
MP box.)  If the while loop iteration concludes without anyone 
acquiring the lock, it sends it to the kernel as no one in this
LWP cares for that lock (and we hit the semantics described above).

I'm wondering if this while loop iteration (essentially a spin
lock) may be the root cause of the lwp_mutex_unlock utilization 
you are seeing.

Anyway, in Solaris 9, IIRC, they have removed the userspace 
scheduler and all threads are now bound (all threads map 
directly to kernel threads).  You may acheive the same result 
in 

Re: remaining CPU bottlenecks in 2.0

2001-09-04 Thread William A. Rowe, Jr.

From: Justin Erenkrantz [EMAIL PROTECTED]
Sent: Tuesday, September 04, 2001 11:46 PM


 Based on the patches you submitted (and my quasi-errant formatting
 patch), I had to read most of the code in mod_include, so I'm more 
 familiar with mod_include now.  I do think there are some obvious 
 ways to optimize find_start_sequence.  I wonder if we could apply 
 a KMP-string matching algorithm here.  I dunno.  I'll take a look 
 at it though.  Something bugs me about the restarts.  I bet that 
 we spend even more time in find_start_sequence when a HTML file 
 has lots of comments.  =-)

You were discussing the possibility of parsing for !--# as a skip by 5.

Consider jumping to a 4 byte alignment, truncating to char and skip by
dwords.  E.g., you only have to test for three values, not four, and you
can use the machine's most optimal path.  But I'd ask if strstr() isn't
optimized on the platform, why are we reinventing it?

This is DSS for little endian (that char bugger comes from the first byte
so skipping in 0-3 is not an issue) but for big endian architectures you 
need to backspace to the dword alignment so you don't miss the tag by
skipping up to 6 (wrong by 3, and then reading fourth byte of dword.)

That has got to be your most optimal search pattern.

Bill