I'm currently studying profiling data from an httpd built from
a CVS snapshot earlier today.
In general, the performance of 2.0 is starting to look good.
Here are the top 30 functions, ranked according to their CPU utilization.
(This summary is derived from a Quantify profile of a threaded-MPM httpd
handling mod_include requests on Solaris; thanks to Blaise Tarr at CNET
for the profile data.)
CPU time
function (% of total)
-------- ------------
1. find_start_sequence 23.9
2. mmap 14.1
3. munmap 10.2
4. _writev 9.4
5. _so_accept 4.7
6. _lwp_mutex_unlock 4.3
7. stat 3.2
8. _close 2.4
9. _write 2.4
10. __door_return 2.3
cumulative
total: 76.9%
11. __open 2.0
12. _so_getsockname 1.6
13. find_entry 1.5
14. strlen 1.2
15. strcasecmp 1.2
16. _so_shutdown 1.2
17. _read 0.84
18. fstat64 0.82
19. __lwp_sema_post 0.80
20. memset 0.78
cumulative total: 88.8%
21. apr_hash_next 0.53
22. apr_file_read 0.46
23. _lrw_unlock 0.39
24. __fork1 0.39
25. apr_palloc 0.39
26. strncpy 0.28
27. memchr 0.26
28. _ltzset_u 0.24
29. overlay_extension_mappings 0.20
30. apr_hash_set 0.20
cumulative total: 92.18%
Some notes on the data:
* The discussion here covers only CPU utilization. There are other
aspects of performance, like multiprocessor scalability, that
are independent of this data.
* 7 of the top 10 functions are syscalls that I'd expect to see in
the top 10 for an httpd. :-)
* find_start_sequence() is the main scanning function within
mod_include. There's some research in progress to try to speed
this up significantly.
* With the exception of find_start_sequence(), all the opportunities
for optimization are individually small. I expect that the only
way to get an additional 5% reduction in CPU usage, for example,
will be to fix ten things that each account for 0.5%.
* _lwp_mutex_unlock() gets from pthread_mutex_unlock(),
but only from a small fraction of pthread_mutex_unlock calls
(Can someone familiar with Solaris threading internals explain
this one?)
* Collectively, stat and open comprise 5% of the total CPU time.
It would be faster to do open+fstat rather than stat+open (as
long as the server is delivering mostly 200s rather than 304s),
but that might be too radical a change. Anybody have thoughts
on this?
* I think the __door_return() call is an anomaly of the profiling
instrumentation itself.
* find_entry() is part of the apr_hash_t implementation. Most of
the calls to find_entry()and apr_hash_next() are coming from the
mod_mime dir-merge function.
* strlen() is called from lots of places throughout the code, with
the most frequent calls being from apr_pstrdup, apr_pstrcat, and
time-formatting functions used in apr_rfc822_date.
* strcasecmp() is used heavily by the apr_table_t implementation
(sort_overlap, apr_table_get, apr_table_setn). Converting tables
to hashes is is a potential speedup. However, most of the table
API calls are on request_rec->headers_(in|out); changing those
would impact a *lot* of code.
* fstat64() happens during fcntl() on sockets (it appears to be a
Solaris implementation detail)
* memset() is called mostly from apr_pcalloc(), which in turn is
used in too many places to yield any easy optimization opportunities.
* memchr() is called from apr_pstrndup(), used mostly in apr_uri_parse()
and mod_mime's analyze_ct(). It may be possible to optimize away
some of the calls in the latter.
--Brian