Hi folks.
I've taken a first, awkward stab at the performance documentation.
> The performance tuning documentation that we currently include in the
> docs is simply awful. What with the comments about Apache 1.2 and the
> suggestions of how to deal with the new Linux 2.0 kernel, I think it's
> beyond fixing. It needs to be tossed and rewritten - although perhaps
> there are parts that are salvageable.
That's basically what I've attempted to do.
I've thrown out everything that's older than a decade, or otherwise
obsolete, or just confusing.
> I was wondering if someone has a performance doc that they could
> contribute as a starting place? Perhaps Sander's talk from AC? Or if
> someone would be willing to give some attention to the docs list for a
> while to assist in writing something that would be useful to actual
> admins in the real world.
Unfortunately, I have never attended any of those talks...
I did however borrow a line from from colmmacc's blog:
http://www.stdlib.net/~colmmacc/2006/03/23/niagara-vs-ftpheanetie-showdown/
This patch is not trying to be complete -- or accurate in its spelling,
for that matter.
It's just supposed to throw away the crusty old bits, which in turn I
hope will get people motivated enough to start doing something about it.
> --
> Rich Bowen
> [email protected]
So long,
--
Igor Galić
Tel: +43 (0) 699 122 96 338
Fax: +43 (0) 1 90 89 226
Mail: [email protected]
URL: http://brainsware.org/
Index: perf-tuning.xml
===================================================================
--- perf-tuning.xml (revision 899345)
+++ perf-tuning.xml (working copy)
@@ -58,7 +58,7 @@
should, control the <directive module="mpm_common"
>MaxClients</directive> setting so that your server
does not spawn so many children it starts swapping. This procedure
- for doing this is simple: determine the size of your average Apache
+ for doing this is simple: determine the size of your average httpd
process, by looking at your process list via a tool such as
<code>top</code>, and divide this into your total available memory,
leaving some room for other processes.</p>
@@ -121,13 +121,10 @@
<title>HostnameLookups and other DNS considerations</title>
- <p>Prior to Apache 1.3, <directive module="core"
- >HostnameLookups</directive> defaulted to <code>On</code>.
- This adds latency to every request because it requires a
- DNS lookup to complete before the request is finished. In
- Apache 1.3 this setting defaults to <code>Off</code>. If you need
- to have addresses in your log files resolved to hostnames, use the
- <program>logresolve</program>
+ <p>Starting with Apache 1.3, <directive module="core"
+ >HostnameLookups</directive> defaults to <code>Off</code>.
+ If you need to have addresses in your log files resolved to
+ hostnames, use the <program>logresolve</program>
program that comes with Apache, or one of the numerous log
reporting packages which are available.</p>
@@ -285,7 +282,7 @@
</section>
- <section>
+ <section id="mmap">
<title>Memory-mapping</title>
@@ -300,13 +297,6 @@
<ul>
<li>
- <p>On some operating systems, <code>mmap</code> does not scale
- as well as <code>read(2)</code> when the number of CPUs increases.
- On multiprocessor Solaris servers, for example, Apache 2.x sometimes
- delivers server-parsed files faster when <code>mmap</code> is disabled.</p>
- </li>
-
- <li>
<p>If you memory-map a file located on an NFS-mounted filesystem
and a process on another NFS client machine deletes or truncates
the file, your process may get a bus error the next time it tries
@@ -321,7 +311,7 @@
</section>
- <section>
+ <section id="sendfile">
<title>Sendfile</title>
@@ -344,6 +334,13 @@
<p>With an NFS-mounted files, the kernel may be unable
to reliably serve the network file through it's own cache.</p>
</li>
+ <li>
+ <p>Although <code>sendfile(2)</code> functions flawlessly on Solaris
+ (unlike on Linux), using it seems to have an impact on performance:
+ While it does reduce the amount of memory used by httpd, it gives
+ slower performance than just read() and write() -- so perhaps its
+ blocking characteristics are slightly different.</p>
+ </li>
</ul>
<p>For installations where either of these factors applies, you
@@ -357,30 +354,7 @@
<title>Process Creation</title>
- <p>Prior to Apache 1.3 the <directive module="prefork"
- >MinSpareServers</directive>, <directive module="prefork"
- >MaxSpareServers</directive>, and <directive module="mpm_common"
- >StartServers</directive> settings all had drastic effects on
- benchmark results. In particular, Apache required a "ramp-up"
- period in order to reach a number of children sufficient to serve
- the load being applied. After the initial spawning of
- <directive module="mpm_common">StartServers</directive> children,
- only one child per second would be created to satisfy the
- <directive module="prefork">MinSpareServers</directive>
- setting. So a server being accessed by 100 simultaneous
- clients, using the default <directive module="mpm_common"
- >StartServers</directive> of <code>5</code> would take on
- the order 95 seconds to spawn enough children to handle
- the load. This works fine in practice on real-life servers,
- because they aren't restarted frequently. But does really
- poorly on benchmarks which might only run for ten minutes.</p>
-
- <p>The one-per-second rule was implemented in an effort to
- avoid swamping the machine with the startup of new children. If
- the machine is busy spawning children it can't service
- requests. But it has such a drastic effect on the perceived
- performance of Apache that it had to be replaced. As of Apache
- 1.3, the code will relax the one-per-second rule. It will spawn
+ <p>A <module>prefork</module>ing MPM will spawn a child
one, wait a second, then spawn two, wait a second, then spawn
four, and it will continue exponentially until it is spawning
32 children per second. It will stop whenever it satisfies the
@@ -402,9 +376,10 @@
setting. By default this is <code>0</code>,
which means that there is no limit to the number of requests
handled per child. If your configuration currently has this set
- to some very low number, such as <code>30</code>, you may want to bump this
- up significantly. If you are running SunOS or an old version of
- Solaris, limit this to <code>10000</code> or so because of memory leaks.</p>
+ to some very low number, such as <code>30</code>, you may want to
+ bump this up significantly. If you are running third party modules
+ or applications you are suspecting to leak memory, limit this to
+ <code>10000</code> or so.</p>
<p>When keep-alives are in use, children will be kept busy
doing nothing waiting for more requests on the already open
@@ -412,9 +387,8 @@
>KeepAliveTimeout</directive> of <code>5</code>
seconds attempts to minimize this effect. The tradeoff here is
between network bandwidth and server resources. In no event
- should you raise this above about <code>60</code> seconds, as <a
- href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-4.html">
- most of the benefits are lost</a>.</p>
+ should you raise this above about <code>60</code> seconds, as
+ most of the benefits are lost.</p>
</section>
@@ -455,6 +429,14 @@
third-party modules, and it is easier to debug on platforms
with poor thread debugging support.</li>
+ <li>The <module>event</module> MPM is based on
+ <module>worker</module> and likewise uses multiple child
+ processes with multiple threads each. Each thread handles
+ one connection at a time and passes it on to one thread
+ dedicated to controling the Listening sockets as well as
+ sockets in Keep Alive state. This enables the MPM to only
+ consume threads for connections with active processing.</li>
+
</ul>
<p>For more information on these and other MPMs, please
@@ -492,62 +474,6 @@
<section>
- <title>Atomic Operations</title>
-
- <p>Some modules, such as <module>mod_cache</module> and
- recent development builds of the worker MPM, use APR's
- atomic API. This API provides atomic operations that can
- be used for lightweight thread synchronization.</p>
-
- <p>By default, APR implements these operations using the
- most efficient mechanism available on each target
- OS/CPU platform. Many modern CPUs, for example, have
- an instruction that does an atomic compare-and-swap (CAS)
- operation in hardware. On some platforms, however, APR
- defaults to a slower, mutex-based implementation of the
- atomic API in order to ensure compatibility with older
- CPU models that lack such instructions. If you are
- building Apache for one of these platforms, and you plan
- to run only on newer CPUs, you can select a faster atomic
- implementation at build time by configuring Apache with
- the <code>--enable-nonportable-atomics</code> option:</p>
-
- <example>
- ./buildconf<br />
- ./configure --with-mpm=worker --enable-nonportable-atomics=yes
- </example>
-
- <p>The <code>--enable-nonportable-atomics</code> option is
- relevant for the following platforms:</p>
-
- <ul>
-
- <li>Solaris on SPARC<br />
- By default, APR uses mutex-based atomics on Solaris/SPARC.
- If you configure with <code>--enable-nonportable-atomics</code>,
- however, APR generates code that uses a SPARC v8plus opcode for
- fast hardware compare-and-swap. If you configure Apache with
- this option, the atomic operations will be more efficient
- (allowing for lower CPU utilization and higher concurrency),
- but the resulting executable will run only on UltraSPARC
- chips.
- </li>
-
- <li>Linux on x86<br />
- By default, APR uses mutex-based atomics on Linux. If you
- configure with <code>--enable-nonportable-atomics</code>,
- however, APR generates code that uses a 486 opcode for fast
- hardware compare-and-swap. This will result in more efficient
- atomic operations, but the resulting executable will run only
- on 486 and later chips (and not on 386).
- </li>
-
- </ul>
-
- </section>
-
- <section>
-
<title>mod_status and ExtendedStatus On</title>
<p>If you include <module>mod_status</module> and you also set
@@ -564,321 +490,6 @@
<section>
- <title>accept Serialization - multiple sockets</title>
-
- <note type="warning"><title>Warning:</title>
- <p>This section has not been fully updated
- to take into account changes made in the 2.x version of the
- Apache HTTP Server. Some of the information may still be
- relevant, but please use it with care.</p>
- </note>
-
- <p>This discusses a shortcoming in the Unix socket API. Suppose
- your web server uses multiple <directive module="mpm_common"
- >Listen</directive> statements to listen on either multiple
- ports or multiple addresses. In order to test each socket
- to see if a connection is ready Apache uses
- <code>select(2)</code>. <code>select(2)</code> indicates that a
- socket has <em>zero</em> or <em>at least one</em> connection
- waiting on it. Apache's model includes multiple children, and
- all the idle ones test for new connections at the same time. A
- naive implementation looks something like this (these examples
- do not match the code, they're contrived for pedagogical
- purposes):</p>
-
- <example>
- for (;;) {<br />
- <indent>
- for (;;) {<br />
- <indent>
- fd_set accept_fds;<br />
- <br />
- FD_ZERO (&accept_fds);<br />
- for (i = first_socket; i <= last_socket; ++i) {<br />
- <indent>
- FD_SET (i, &accept_fds);<br />
- </indent>
- }<br />
- rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);<br />
- if (rc < 1) continue;<br />
- new_connection = -1;<br />
- for (i = first_socket; i <= last_socket; ++i) {<br />
- <indent>
- if (FD_ISSET (i, &accept_fds)) {<br />
- <indent>
- new_connection = accept (i, NULL, NULL);<br />
- if (new_connection != -1) break;<br />
- </indent>
- }<br />
- </indent>
- }<br />
- if (new_connection != -1) break;<br />
- </indent>
- }<br />
- process the new_connection;<br />
- </indent>
- }
- </example>
-
- <p>But this naive implementation has a serious starvation problem.
- Recall that multiple children execute this loop at the same
- time, and so multiple children will block at
- <code>select</code> when they are in between requests. All
- those blocked children will awaken and return from
- <code>select</code> when a single request appears on any socket
- (the number of children which awaken varies depending on the
- operating system and timing issues). They will all then fall
- down into the loop and try to <code>accept</code> the
- connection. But only one will succeed (assuming there's still
- only one connection ready), the rest will be <em>blocked</em>
- in <code>accept</code>. This effectively locks those children
- into serving requests from that one socket and no other
- sockets, and they'll be stuck there until enough new requests
- appear on that socket to wake them all up. This starvation
- problem was first documented in <a
- href="http://bugs.apache.org/index/full/467">PR#467</a>. There
- are at least two solutions.</p>
-
- <p>One solution is to make the sockets non-blocking. In this
- case the <code>accept</code> won't block the children, and they
- will be allowed to continue immediately. But this wastes CPU
- time. Suppose you have ten idle children in
- <code>select</code>, and one connection arrives. Then nine of
- those children will wake up, try to <code>accept</code> the
- connection, fail, and loop back into <code>select</code>,
- accomplishing nothing. Meanwhile none of those children are
- servicing requests that occurred on other sockets until they
- get back up to the <code>select</code> again. Overall this
- solution does not seem very fruitful unless you have as many
- idle CPUs (in a multiprocessor box) as you have idle children,
- not a very likely situation.</p>
-
- <p>Another solution, the one used by Apache, is to serialize
- entry into the inner loop. The loop looks like this
- (differences highlighted):</p>
-
- <example>
- for (;;) {<br />
- <indent>
- <strong>accept_mutex_on ();</strong><br />
- for (;;) {<br />
- <indent>
- fd_set accept_fds;<br />
- <br />
- FD_ZERO (&accept_fds);<br />
- for (i = first_socket; i <= last_socket; ++i) {<br />
- <indent>
- FD_SET (i, &accept_fds);<br />
- </indent>
- }<br />
- rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);<br />
- if (rc < 1) continue;<br />
- new_connection = -1;<br />
- for (i = first_socket; i <= last_socket; ++i) {<br />
- <indent>
- if (FD_ISSET (i, &accept_fds)) {<br />
- <indent>
- new_connection = accept (i, NULL, NULL);<br />
- if (new_connection != -1) break;<br />
- </indent>
- }<br />
- </indent>
- }<br />
- if (new_connection != -1) break;<br />
- </indent>
- }<br />
- <strong>accept_mutex_off ();</strong><br />
- process the new_connection;<br />
- </indent>
- }
- </example>
-
- <p><a id="serialize" name="serialize">The functions</a>
- <code>accept_mutex_on</code> and <code>accept_mutex_off</code>
- implement a mutual exclusion semaphore. Only one child can have
- the mutex at any time. There are several choices for
- implementing these mutexes. The choice is defined in
- <code>src/conf.h</code> (pre-1.3) or
- <code>src/include/ap_config.h</code> (1.3 or later). Some
- architectures do not have any locking choice made, on these
- architectures it is unsafe to use multiple
- <directive module="mpm_common">Listen</directive>
- directives.</p>
-
- <p>The <directive module="core">Mutex</directive> directive can
- be used to change the mutex implementation of the
- <code>mpm-accept</code> mutex at run-time. Special considerations
- for different mutex implementations are documented with that
- directive.</p>
-
- <p>Another solution that has been considered but never
- implemented is to partially serialize the loop -- that is, let
- in a certain number of processes. This would only be of
- interest on multiprocessor boxes where it's possible multiple
- children could run simultaneously, and the serialization
- actually doesn't take advantage of the full bandwidth. This is
- a possible area of future investigation, but priority remains
- low because highly parallel web servers are not the norm.</p>
-
- <p>Ideally you should run servers without multiple
- <directive module="mpm_common">Listen</directive>
- statements if you want the highest performance.
- But read on.</p>
-
- </section>
-
- <section>
-
- <title>accept Serialization - single socket</title>
-
- <p>The above is fine and dandy for multiple socket servers, but
- what about single socket servers? In theory they shouldn't
- experience any of these same problems because all children can
- just block in <code>accept(2)</code> until a connection
- arrives, and no starvation results. In practice this hides
- almost the same "spinning" behaviour discussed above in the
- non-blocking solution. The way that most TCP stacks are
- implemented, the kernel actually wakes up all processes blocked
- in <code>accept</code> when a single connection arrives. One of
- those processes gets the connection and returns to user-space,
- the rest spin in the kernel and go back to sleep when they
- discover there's no connection for them. This spinning is
- hidden from the user-land code, but it's there nonetheless.
- This can result in the same load-spiking wasteful behaviour
- that a non-blocking solution to the multiple sockets case
- can.</p>
-
- <p>For this reason we have found that many architectures behave
- more "nicely" if we serialize even the single socket case. So
- this is actually the default in almost all cases. Crude
- experiments under Linux (2.0.30 on a dual Pentium pro 166
- w/128Mb RAM) have shown that the serialization of the single
- socket case causes less than a 3% decrease in requests per
- second over unserialized single-socket. But unserialized
- single-socket showed an extra 100ms latency on each request.
- This latency is probably a wash on long haul lines, and only an
- issue on LANs. If you want to override the single socket
- serialization you can define
- <code>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</code> and then
- single-socket servers will not serialize at all.</p>
-
- </section>
-
- <section>
-
- <title>Lingering Close</title>
-
- <p>As discussed in <a
- href="http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt">
- draft-ietf-http-connection-00.txt</a> section 8, in order for
- an HTTP server to <strong>reliably</strong> implement the
- protocol it needs to shutdown each direction of the
- communication independently (recall that a TCP connection is
- bi-directional, each half is independent of the other).</p>
-
- <p>When this feature was added to Apache it caused a flurry of
- problems on various versions of Unix because of a
- shortsightedness. The TCP specification does not state that the
- <code>FIN_WAIT_2</code> state has a timeout, but it doesn't prohibit it.
- On systems without the timeout, Apache 1.2 induces many sockets
- stuck forever in the <code>FIN_WAIT_2</code> state. In many cases this
- can be avoided by simply upgrading to the latest TCP/IP patches
- supplied by the vendor. In cases where the vendor has never
- released patches (<em>i.e.</em>, SunOS4 -- although folks with
- a source license can patch it themselves) we have decided to
- disable this feature.</p>
-
- <p>There are two ways of accomplishing this. One is the socket
- option <code>SO_LINGER</code>. But as fate would have it, this
- has never been implemented properly in most TCP/IP stacks. Even
- on those stacks with a proper implementation (<em>i.e.</em>,
- Linux 2.0.31) this method proves to be more expensive (cputime)
- than the next solution.</p>
-
- <p>For the most part, Apache implements this in a function
- called <code>lingering_close</code> (in
- <code>http_main.c</code>). The function looks roughly like
- this:</p>
-
- <example>
- void lingering_close (int s)<br />
- {<br />
- <indent>
- char junk_buffer[2048];<br />
- <br />
- /* shutdown the sending side */<br />
- shutdown (s, 1);<br />
- <br />
- signal (SIGALRM, lingering_death);<br />
- alarm (30);<br />
- <br />
- for (;;) {<br />
- <indent>
- select (s for reading, 2 second timeout);<br />
- if (error) break;<br />
- if (s is ready for reading) {<br />
- <indent>
- if (read (s, junk_buffer, sizeof (junk_buffer)) <= 0) {<br />
- <indent>
- break;<br />
- </indent>
- }<br />
- /* just toss away whatever is here */<br />
- </indent>
- }<br />
- </indent>
- }<br />
- <br />
- close (s);<br />
- </indent>
- }
- </example>
-
- <p>This naturally adds some expense at the end of a connection,
- but it is required for a reliable implementation. As HTTP/1.1
- becomes more prevalent, and all connections are persistent,
- this expense will be amortized over more requests. If you want
- to play with fire and disable this feature you can define
- <code>NO_LINGCLOSE</code>, but this is not recommended at all.
- In particular, as HTTP/1.1 pipelined persistent connections
- come into use <code>lingering_close</code> is an absolute
- necessity (and <a
- href="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html">
- pipelined connections are faster</a>, so you want to support
- them).</p>
-
- </section>
-
- <section>
-
- <title>Scoreboard File</title>
-
- <p>Apache's parent and children communicate with each other
- through something called the scoreboard. Ideally this should be
- implemented in shared memory. For those operating systems that
- we either have access to, or have been given detailed ports
- for, it typically is implemented using shared memory. The rest
- default to using an on-disk file. The on-disk file is not only
- slow, but it is unreliable (and less featured). Peruse the
- <code>src/main/conf.h</code> file for your architecture and
- look for either <code>USE_MMAP_SCOREBOARD</code> or
- <code>USE_SHMGET_SCOREBOARD</code>. Defining one of those two
- (as well as their companions <code>HAVE_MMAP</code> and
- <code>HAVE_SHMGET</code> respectively) enables the supplied
- shared memory code. If your system has another type of shared
- memory, edit the file <code>src/main/http_main.c</code> and add
- the hooks necessary to use it in Apache. (Send us back a patch
- too please.)</p>
-
- <note>Historical note: The Linux port of Apache didn't start to
- use shared memory until version 1.2 of Apache. This oversight
- resulted in really poor and unreliable behaviour of earlier
- versions of Apache on Linux.</note>
-
- </section>
-
- <section>
-
<title>DYNAMIC_MODULE_LIMIT</title>
<p>If you have no intention of using dynamically loaded modules