Re: Remaining CPU bottlenecks in 2.0, revisited
Justin Erenkrantz wrote: On Fri, Sep 28, 2001 at 04:57:05PM -0700, Brian Pane wrote: The notes following the profile offer my interpretation of the numbers. % of totalcumulative % rank function CPU timeof total CPU time -- -- - 1. munmap 18.16 2. _writev14.00 3. mmap 11.96 4. _so_accept 5.71 5. bndm4.82 6. _close 4.76 7. _so_shutdown4.16 8. __open 3.18 9. _write 2.61 10. memset 2.38 71.74 Most of these look related to OS things that we can't control. Yes. Or, to put it another way, the implementation efficiency is approaching its theoretical optimum for this architecture. That's not a bad thing. :-) As you said, bndm is ~4x better than its predecessor. Unless we were to precompute mod_include files or rip out its parsing algorithm with something better (which I think is possible), I don't think there is much we can do here. Agreed, there probably isn't much more that can be done. I think 2.0's SSI parser is very close to optimal now. I'm surprised _os_accept, _close, and _so_shutdown are up there in CPU usage. I also know that gstein has gone record on his advogato diary that mmap may not be the performance win we think it may be. It may be worth trying to compile without mmap and see how we perform. Also, I'm guessing this is on Solaris 8? Any chance we could get the sendfilev patch on there? (I know it is part of 7/01 and Sol9 as well.) I'll see if we can get some more comparative numbers this week... Out of curiousity, how is the performance relative to 1.3 for similar loads or URLS? -- justin I just ran a quick comparison, using ab with concurrency==1 over the loopback on uniprocessor Linux (so these results show speed, but not scalability). 50KB non-SSI HTML file: 1141 requests/s with 2.0.26-dev 948 requests/s with 1.3.20 SHTML file with 3 includes: 411 requests/s with 2.0.26-dev (total size 50KB) 200 requests/s with 1.3.20 0KB non-SSI HTML file: 2764 requests/s with 2.0.26-dev 2937 requests/s with 1.3.20 My interpretation of these results is that 2.0's basic request handling is a bit less efficient than 1.3, as shown on the 0KB test, but its use of sendfile for non-SSI requests and much-improved mod_include code for SSI requests gives it an advantage in the general case. --Brian
Re: remaining CPU bottlenecks in 2.0
On Wed, Sep 05, 2001 at 08:40:24PM -0700, Justin Erenkrantz wrote: On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote: I got 1 more question about the solaris implementation of the Threaded/Worker MPM. should we be called the setconcurrency flag on startup ? I know solaris figures it out along the way, but a bit of gentle prodding never hurt. Aaron and I went back and forth on this. He was of the mind that the user scheduler should figure it out after a period of time (and that we shouldn't muck with it). Actually, I seem to recall suggesting that we add something like apr_thread_setconcurrency() that attempts to do the right thing if the platform supports it (and if it doesn't, it's probably no big deal), and is otherwise just a noop. This may help solaris adapt quicker to a fast ramp-up of userspace-to-LWP thread mapping, especially when there are many compute-bound threads that are initially unwilling to yield. I have a feeling this may be more important on SMP boxes, where threads should be migrated to become bound threads as soon as possible. The badness happens in testthread.c - Solaris executes that code in serial since the user scheduler never has a chance to interrupt and balance the threads. But, in the real code, Solaris balances it since we hit syscalls and mutexes quite often. An equivalent to calling pthread_setconcurrency is to use /usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so. ISTR that this will cause every thread to become a bound thread, effectively eliminating the [theoretical] benefits of a partial userspace thread impl. -aaron
Re: remaining CPU bottlenecks in 2.0
From: dean gaudet [EMAIL PROTECTED] Sent: Wednesday, September 05, 2001 9:17 PM On Tue, 4 Sep 2001, Brian Pane wrote: * memset() is called mostly from apr_pcalloc(), which in turn is used in too many places to yield any easy optimization opportunities. sometimes folks are lazy and ask for zeroed memory out of habit, when they could easily deal with garbage at less cost. True. Or folks are being pedantic and summarizing a dozen NULL assignments immediately after an alloc - I'd suggest that group takes a leap into a commented block just for clarity, and we skip initializing long lists of NULL values. Use pcalloc where the majority of the allocation is zeroed, and palloc everywhere else.
mod_include.c WAS: RE: remaining CPU bottlenecks in 2.0
* find_start_sequence() is the main scanning function within mod_include. There's some research in progress to try to speed this up significantly. Based on the patches you submitted (and my quasi-errant formatting patch), I had to read most of the code in mod_include, so I'm more familiar with mod_include now. I do think there are some obvious ways to optimize find_start_sequence. I wonder if we could apply a KMP-string matching algorithm here. I dunno. I'll take a look at it though. Something bugs me about the restarts. I bet that we spend even more time in find_start_sequence when a HTML file has lots of comments. =-) I suggested to Ian yesterday night that I'd look into a Boyer-Moore matching algorithm. I'll work on that too. Sander
Re: remaining CPU bottlenecks in 2.0
On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote: ... You were discussing the possibility of parsing for !--# as a skip by 5. Consider jumping to a 4 byte alignment, truncating to char and skip by dwords. E.g., you only have to test for three values, not four, and you can use the machine's most optimal path. But I'd ask if strstr() isn't optimized on the platform, why are we reinventing it? strstr() can't find strings that span a bucket boundary. Cheers, -g -- Greg Stein, http://www.lyra.org/
Re: remaining CPU bottlenecks in 2.0
Greg Stein wrote: On Tue, Sep 04, 2001 at 11:56:48PM -0500, William A. Rowe, Jr. wrote: ... You were discussing the possibility of parsing for !--# as a skip by 5. Consider jumping to a 4 byte alignment, truncating to char and skip by dwords. E.g., you only have to test for three values, not four, and you can use the machine's most optimal path. But I'd ask if strstr() isn't optimized on the platform, why are we reinventing it? strstr() can't find strings that span a bucket boundary. In addition, the most promising search algorithms for this problem require a lookup table that contains metadata about the search pattern. We can save time by precomputing the lookup table for !--# at server startup (or even at compile time), which is something that strstr() can't do. --Brian
Re: remaining CPU bottlenecks in 2.0
Justin Erenkrantz wrote: [...] * The discussion here covers only CPU utilization. There are other aspects of performance, like multiprocessor scalability, that are independent of this data. Once we get the syscalls optimized (I'm reminded of Dean's attack on our number of syscalls in 1.3 - I believe he went through syscall by syscall trying to eliminate all of the unnecessary ones), I think the next performance point will be MP scalability (see below for lock scalability on solaris). But, we do need to see what we can do about optimizing the syscalls though... The syscall profile for 2.0 looks close to optimal (as long as you disable .htaccess files and symlink checking). We're doing a few too many gettimeofday calls per request, though; I 'll investigate whether any of these can be optimized away. (Fortunately, gettimeofday is cheap compared to time(2).) [...] * _lwp_mutex_unlock() gets from pthread_mutex_unlock(), but only from a small fraction of pthread_mutex_unlock calls (Can someone familiar with Solaris threading internals explain this one?) The LWP scheduler may also call _lwp_mutex_unlock() implicitly - LWP scheduler is a user-space library so it gets thrown in with our numbers I bet. Here's some background on Solaris's implementation that I think may provide some useful information as to how the locks will perform overall. (If you spot any inconsistencies, it is probably my fault...I'm going to try to explain this as best as I can...) First off, Solaris has adaptive locks. Depending if the owner of the lock is currently active, it will spin. If the system sees that the owner of the held lock is not currently active, it will sleep (they call this an adaptive lock - it now enters a turnstile). Ah, I guess that explains why only a small fraction of pthread_mutex_lock calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast case where the lock is available, it just stays in user-mode code? --Brian
Re: remaining CPU bottlenecks in 2.0
On Wed, Sep 05, 2001 at 08:19:50PM -0700, Brian Pane wrote: Ah, I guess that explains why only a small fraction of pthread_mutex_lock calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast case where the lock is available, it just stays in user-mode code? Yes. -- justin
Re: remaining CPU bottlenecks in 2.0
I got 1 more question about the solaris implementation of the Threaded/Worker MPM. should we be called the setconcurrency flag on startup ? I know solaris figures it out along the way, but a bit of gentle prodding never hurt. ..Ian Brian Pane wrote: Justin Erenkrantz wrote: [...] * The discussion here covers only CPU utilization. There are other aspects of performance, like multiprocessor scalability, that are independent of this data. Once we get the syscalls optimized (I'm reminded of Dean's attack on our number of syscalls in 1.3 - I believe he went through syscall by syscall trying to eliminate all of the unnecessary ones), I think the next performance point will be MP scalability (see below for lock scalability on solaris). But, we do need to see what we can do about optimizing the syscalls though... The syscall profile for 2.0 looks close to optimal (as long as you disable .htaccess files and symlink checking). We're doing a few too many gettimeofday calls per request, though; I 'll investigate whether any of these can be optimized away. (Fortunately, gettimeofday is cheap compared to time(2).) [...] * _lwp_mutex_unlock() gets from pthread_mutex_unlock(), but only from a small fraction of pthread_mutex_unlock calls (Can someone familiar with Solaris threading internals explain this one?) The LWP scheduler may also call _lwp_mutex_unlock() implicitly - LWP scheduler is a user-space library so it gets thrown in with our numbers I bet. Here's some background on Solaris's implementation that I think may provide some useful information as to how the locks will perform overall. (If you spot any inconsistencies, it is probably my fault...I'm going to try to explain this as best as I can...) First off, Solaris has adaptive locks. Depending if the owner of the lock is currently active, it will spin. If the system sees that the owner of the held lock is not currently active, it will sleep (they call this an adaptive lock - it now enters a turnstile). Ah, I guess that explains why only a small fraction of pthread_mutex_lock calls on Solaris seem to result in calls to lwp_mutex_lock: in the fast case where the lock is available, it just stays in user-mode code? --Brian
Re: remaining CPU bottlenecks in 2.0
On Wed, Sep 05, 2001 at 08:34:30PM -0700, Ian Holsman wrote: I got 1 more question about the solaris implementation of the Threaded/Worker MPM. should we be called the setconcurrency flag on startup ? I know solaris figures it out along the way, but a bit of gentle prodding never hurt. Aaron and I went back and forth on this. He was of the mind that the user scheduler should figure it out after a period of time (and that we shouldn't muck with it). The badness happens in testthread.c - Solaris executes that code in serial since the user scheduler never has a chance to interrupt and balance the threads. But, in the real code, Solaris balances it since we hit syscalls and mutexes quite often. An equivalent to calling pthread_setconcurrency is to use /usr/lib/lwp/libthread.so instead of /usr/lib/libthread.so. -- justin
Re: remaining CPU bottlenecks in 2.0
On Tue, 4 Sep 2001, Brian Pane wrote: * Collectively, stat and open comprise 5% of the total CPU time. It would be faster to do open+fstat rather than stat+open (as long as the server is delivering mostly 200s rather than 304s), but that might be too radical a change. Anybody have thoughts on this? linux' dcache makes either method about the same cost, but on other kernels, and especially when NFS is involved (because of how NFS path resolution occurs), the open/fstat is much better. the trick would be to cache open filehandles so that fstat can be used. * memset() is called mostly from apr_pcalloc(), which in turn is used in too many places to yield any easy optimization opportunities. sometimes folks are lazy and ask for zeroed memory out of habit, when they could easily deal with garbage at less cost. -dean
Re: remaining CPU bottlenecks in 2.0
dean gaudet wrote: [...] * memset() is called mostly from apr_pcalloc(), which in turn is used in too many places to yield any easy optimization opportunities. sometimes folks are lazy and ask for zeroed memory out of habit, when they could easily deal with garbage at less cost. Some good news here: I dug through the profile data and found that over half the httpd's calls to apr_pcalloc are from one function, 'find_entry' in the apr_hash_set implementation. It uses apr_pcalloc to allocate a struct with 5 fields, and then it immediately overwrites 4 of those fields. I'll submit a patch to the APR dev list to fix this. --Brian
Re: remaining CPU bottlenecks in 2.0
On Tue, Sep 04, 2001 at 08:00:35PM -0700, Brian Pane wrote: I'm currently studying profiling data from an httpd built from a CVS snapshot earlier today. In general, the performance of 2.0 is starting to look good. Cool. This probably means the code is starting to look good, too. * The discussion here covers only CPU utilization. There are other aspects of performance, like multiprocessor scalability, that are independent of this data. Once we get the syscalls optimized (I'm reminded of Dean's attack on our number of syscalls in 1.3 - I believe he went through syscall by syscall trying to eliminate all of the unnecessary ones), I think the next performance point will be MP scalability (see below for lock scalability on solaris). But, we do need to see what we can do about optimizing the syscalls though... * find_start_sequence() is the main scanning function within mod_include. There's some research in progress to try to speed this up significantly. Based on the patches you submitted (and my quasi-errant formatting patch), I had to read most of the code in mod_include, so I'm more familiar with mod_include now. I do think there are some obvious ways to optimize find_start_sequence. I wonder if we could apply a KMP-string matching algorithm here. I dunno. I'll take a look at it though. Something bugs me about the restarts. I bet that we spend even more time in find_start_sequence when a HTML file has lots of comments. =-) * strlen() is called from lots of places throughout the code, with the most frequent calls being from apr_pstrdup, apr_pstrcat, and time-formatting functions used in apr_rfc822_date. I think someone has brought up that apr_pstrdup does an extra strlen. I'll have to review that code. * _lwp_mutex_unlock() gets from pthread_mutex_unlock(), but only from a small fraction of pthread_mutex_unlock calls (Can someone familiar with Solaris threading internals explain this one?) The LWP scheduler may also call _lwp_mutex_unlock() implicitly - LWP scheduler is a user-space library so it gets thrown in with our numbers I bet. Here's some background on Solaris's implementation that I think may provide some useful information as to how the locks will perform overall. (If you spot any inconsistencies, it is probably my fault...I'm going to try to explain this as best as I can...) First off, Solaris has adaptive locks. Depending if the owner of the lock is currently active, it will spin. If the system sees that the owner of the held lock is not currently active, it will sleep (they call this an adaptive lock - it now enters a turnstile). Okay, so what happens when a mutex unlocks? This depends on whether you are in spin or adaptive lock. Spin locks immediately see the freed lock and the first one on the CPU grabs it (excluding priority inversion here.). But, the more interesting issue is what happens when we are in an adaptive lock. According to Mauro, Solaris 7+ has a thundering herd condition for adaptive kernel locks. It will wake up *all* waiters and lets them fight it out. This was a difference from Solaris 2.5.1 and 2.6. Since you should never have a lot of threads sitting in a mutex (according to Sun, it is typical in practice to only have one kernel thread waiting), this thundering herd is okay and actually performs better than freeing the lock and only waking up one waiter. They say it makes the code much cleaner. I'm betting we're overusing the mutexes which changes the equation considerably. =-) Okay, that tells you how it is done in kernel-space. For user-space (see below how to remove user-space threading), it is slightly different. Remember in Solaris, we have a two-tier thread-model - user-space and kernel threads. In user-space, when we call pthread_mutex_unlock, we hit _mutex_unlock in liblwp, which calls mutex_unlock_adaptive as we aren't a special case lock. So, what does mutex_unlock_adaptive do in lwp? In pseudocode (as best as I can explain it), it first tries to see if there is a waiter in the current LWP, if so, it clears the lock bit and the other thread in the same LWP takes it. If no thread in the same LWP is waiting, it will then sleep for ~500 while loop iterations to let another thread in the same LWP take the thread. (On a UP box, it'll exit here before doing the while loop as it knows that spinlocks are stupid. I think you were testing on a MP box.) If the while loop iteration concludes without anyone acquiring the lock, it sends it to the kernel as no one in this LWP cares for that lock (and we hit the semantics described above). I'm wondering if this while loop iteration (essentially a spin lock) may be the root cause of the lwp_mutex_unlock utilization you are seeing. Anyway, in Solaris 9, IIRC, they have removed the userspace scheduler and all threads are now bound (all threads map directly to kernel threads). You may acheive the same result in
Re: remaining CPU bottlenecks in 2.0
From: Justin Erenkrantz [EMAIL PROTECTED] Sent: Tuesday, September 04, 2001 11:46 PM Based on the patches you submitted (and my quasi-errant formatting patch), I had to read most of the code in mod_include, so I'm more familiar with mod_include now. I do think there are some obvious ways to optimize find_start_sequence. I wonder if we could apply a KMP-string matching algorithm here. I dunno. I'll take a look at it though. Something bugs me about the restarts. I bet that we spend even more time in find_start_sequence when a HTML file has lots of comments. =-) You were discussing the possibility of parsing for !--# as a skip by 5. Consider jumping to a 4 byte alignment, truncating to char and skip by dwords. E.g., you only have to test for three values, not four, and you can use the machine's most optimal path. But I'd ask if strstr() isn't optimized on the platform, why are we reinventing it? This is DSS for little endian (that char bugger comes from the first byte so skipping in 0-3 is not an issue) but for big endian architectures you need to backspace to the dword alignment so you don't miss the tag by skipping up to 6 (wrong by 3, and then reading fourth byte of dword.) That has got to be your most optimal search pattern. Bill