Re: Performance tuning documentation

Igor Galić Thu, 14 Jan 2010 16:38:47 -0800

Hi folks.

I've taken a first, awkward stab at the performance documentation.


> The performance tuning documentation that we currently include in the 
> docs is simply awful. What with the comments about Apache 1.2 and the 
> suggestions of how to deal with the new Linux 2.0 kernel, I think it's
> beyond fixing. It needs to be tossed and rewritten - although perhaps 
> there are parts that are salvageable.

That's basically what I've attempted to do.
I've thrown out everything that's older than a decade, or otherwise
obsolete, or just confusing.

> I was wondering if someone has a performance doc that they could  
> contribute as a starting place? Perhaps Sander's talk from AC? Or if 
> someone would be willing to give some attention to the docs list for a
> while to assist in writing something that would be useful to actual  
> admins in the real world.

Unfortunately, I have never attended any of those talks...
I did however borrow a line from from colmmacc's blog:
http://www.stdlib.net/~colmmacc/2006/03/23/niagara-vs-ftpheanetie-showdown/


This patch is not trying to be complete -- or accurate in its spelling,
for that matter.
It's just supposed to throw away the crusty old bits, which in turn I
hope will get people motivated enough to start doing something about it.

> --
> Rich Bowen
> [email protected]

So long,
-- 
Igor Galić

Tel: +43 (0) 699 122 96 338
Fax: +43 (0) 1 90 89 226
Mail: [email protected]
URL: http://brainsware.org/

Index: perf-tuning.xml
===================================================================
--- perf-tuning.xml	(revision 899345)
+++ perf-tuning.xml	(working copy)
@@ -58,7 +58,7 @@
     should, control the <directive module="mpm_common"
     >MaxClients</directive> setting so that your server
     does not spawn so many children it starts swapping. This procedure
-    for doing this is simple: determine the size of your average Apache
+    for doing this is simple: determine the size of your average httpd
     process, by looking at your process list via a tool such as
     <code>top</code>, and divide this into your total available memory,
     leaving some room for other processes.</p>
@@ -121,13 +121,10 @@
 
       <title>HostnameLookups and other DNS considerations</title>
 
-      <p>Prior to Apache 1.3, <directive module="core"
-      >HostnameLookups</directive> defaulted to <code>On</code>.
-      This adds latency to every request because it requires a
-      DNS lookup to complete before the request is finished. In
-      Apache 1.3 this setting defaults to <code>Off</code>. If you need
-      to have addresses in your log files resolved to hostnames, use the
-      <program>logresolve</program>
+      <p>Starting with Apache 1.3, <directive module="core"
+      >HostnameLookups</directive> defaults to <code>Off</code>.
+      If you need to have addresses in your log files resolved to
+      hostnames, use the <program>logresolve</program>
       program that comes with Apache, or one of the numerous log
       reporting packages which are available.</p>
 
@@ -285,7 +282,7 @@
 
     </section>
 
-    <section>
+    <section id="mmap">
 
       <title>Memory-mapping</title>
 
@@ -300,13 +297,6 @@
 
       <ul>
         <li>
-          <p>On some operating systems, <code>mmap</code> does not scale
-          as well as <code>read(2)</code> when the number of CPUs increases.
-          On multiprocessor Solaris servers, for example, Apache 2.x sometimes
-          delivers server-parsed files faster when <code>mmap</code> is disabled.</p>
-        </li>
-
-        <li>
           <p>If you memory-map a file located on an NFS-mounted filesystem
           and a process on another NFS client machine deletes or truncates
           the file, your process may get a bus error the next time it tries
@@ -321,7 +311,7 @@
 
     </section>
 
-    <section>
+    <section id="sendfile">
 
       <title>Sendfile</title>
 
@@ -344,6 +334,13 @@
           <p>With an NFS-mounted files, the kernel may be unable
           to reliably serve the network file through it's own cache.</p>
         </li>
+	<li>
+	  <p>Although <code>sendfile(2)</code> functions flawlessly on Solaris
+	  (unlike on Linux), using it seems to have an impact on performance:
+	  While it does reduce the amount of  memory used by httpd, it gives
+	  slower performance than just read() and write() -- so perhaps its
+	  blocking characteristics are slightly different.</p> 
+	</li>
       </ul>
 
       <p>For installations where either of these factors applies, you
@@ -357,30 +354,7 @@
 
       <title>Process Creation</title>
 
-      <p>Prior to Apache 1.3 the <directive module="prefork"
-      >MinSpareServers</directive>, <directive module="prefork"
-      >MaxSpareServers</directive>, and <directive module="mpm_common"
-      >StartServers</directive> settings all had drastic effects on
-      benchmark results. In particular, Apache required a "ramp-up"
-      period in order to reach a number of children sufficient to serve
-      the load being applied. After the initial spawning of
-      <directive module="mpm_common">StartServers</directive> children,
-      only one child per second would be created to satisfy the
-      <directive module="prefork">MinSpareServers</directive>
-      setting. So a server being accessed by 100 simultaneous
-      clients, using the default <directive module="mpm_common"
-      >StartServers</directive> of <code>5</code> would take on
-      the order 95 seconds to spawn enough children to handle
-      the load. This works fine in practice on real-life servers,
-      because they aren't restarted frequently. But does really
-      poorly on benchmarks which might only run for ten minutes.</p>
-
-      <p>The one-per-second rule was implemented in an effort to
-      avoid swamping the machine with the startup of new children. If
-      the machine is busy spawning children it can't service
-      requests. But it has such a drastic effect on the perceived
-      performance of Apache that it had to be replaced. As of Apache
-      1.3, the code will relax the one-per-second rule. It will spawn
+      <p>A <module>prefork</module>ing MPM will spawn a child
       one, wait a second, then spawn two, wait a second, then spawn
       four, and it will continue exponentially until it is spawning
       32 children per second. It will stop whenever it satisfies the
@@ -402,9 +376,10 @@
     setting. By default this is <code>0</code>,
     which means that there is no limit to the number of requests
     handled per child. If your configuration currently has this set
-    to some very low number, such as <code>30</code>, you may want to bump this
-    up significantly. If you are running SunOS or an old version of
-    Solaris, limit this to <code>10000</code> or so because of memory leaks.</p>
+    to some very low number, such as <code>30</code>, you may want to
+    bump this up significantly. If you are running third party modules
+    or applications you are suspecting to leak memory, limit this to
+    <code>10000</code> or so.</p>
 
     <p>When keep-alives are in use, children will be kept busy
     doing nothing waiting for more requests on the already open
@@ -412,9 +387,8 @@
     >KeepAliveTimeout</directive> of <code>5</code>
     seconds attempts to minimize this effect. The tradeoff here is
     between network bandwidth and server resources. In no event
-    should you raise this above about <code>60</code> seconds, as <a
-    href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-4.html";>
-    most of the benefits are lost</a>.</p>
+    should you raise this above about <code>60</code> seconds, as
+    most of the benefits are lost.</p>
 
     </section>
 
@@ -455,6 +429,14 @@
         third-party modules, and it is easier to debug on platforms
         with poor thread debugging support.</li>
 
+	<li>The <module>event</module> MPM is based on
+	<module>worker</module> and likewise uses multiple child
+	processes with multiple threads each. Each thread handles
+	one connection at a time and passes it on to one thread
+	dedicated to controling the Listening sockets as well as
+	sockets in Keep Alive state. This enables the MPM to only
+	consume threads for connections with active processing.</li>
+
       </ul>
 
       <p>For more information on these and other MPMs, please
@@ -492,62 +474,6 @@
 
     <section>
 
-      <title>Atomic Operations</title>
-
-      <p>Some modules, such as <module>mod_cache</module> and
-      recent development builds of the worker MPM, use APR's
-      atomic API.  This API provides atomic operations that can
-      be used for lightweight thread synchronization.</p>
-
-      <p>By default, APR implements these operations using the
-      most efficient mechanism available on each target
-      OS/CPU platform.  Many modern CPUs, for example, have
-      an instruction that does an atomic compare-and-swap (CAS)
-      operation in hardware.  On some platforms, however, APR
-      defaults to a slower, mutex-based implementation of the
-      atomic API in order to ensure compatibility with older
-      CPU models that lack such instructions.  If you are
-      building Apache for one of these platforms, and you plan
-      to run only on newer CPUs, you can select a faster atomic
-      implementation at build time by configuring Apache with
-      the <code>--enable-nonportable-atomics</code> option:</p>
-
-      <example>
-        ./buildconf<br />
-        ./configure --with-mpm=worker --enable-nonportable-atomics=yes
-      </example>
-
-      <p>The <code>--enable-nonportable-atomics</code> option is
-      relevant for the following platforms:</p>
-
-      <ul>
-
-        <li>Solaris on SPARC<br />
-            By default, APR uses mutex-based atomics on Solaris/SPARC.
-            If you configure with <code>--enable-nonportable-atomics</code>,
-            however, APR generates code that uses a SPARC v8plus opcode for
-            fast hardware compare-and-swap.  If you configure Apache with
-            this option, the atomic operations will be more efficient
-            (allowing for lower CPU utilization and higher concurrency),
-            but the resulting executable will run only on UltraSPARC
-            chips.
-        </li>
-
-        <li>Linux on x86<br />
-            By default, APR uses mutex-based atomics on Linux.  If you
-            configure with <code>--enable-nonportable-atomics</code>,
-            however, APR generates code that uses a 486 opcode for fast
-            hardware compare-and-swap.  This will result in more efficient
-            atomic operations, but the resulting executable will run only
-            on 486 and later chips (and not on 386).
-        </li>
-
-      </ul>
-
-    </section>
-
-    <section>
-
       <title>mod_status and ExtendedStatus On</title>
 
       <p>If you include <module>mod_status</module> and you also set
@@ -564,321 +490,6 @@
 
     <section>
 
-      <title>accept Serialization - multiple sockets</title>
-
-    <note type="warning"><title>Warning:</title>
-      <p>This section has not been fully updated
-      to take into account changes made in the 2.x version of the
-      Apache HTTP Server. Some of the information may still be
-      relevant, but please use it with care.</p>
-    </note>
-
-      <p>This discusses a shortcoming in the Unix socket API. Suppose
-      your web server uses multiple <directive module="mpm_common"
-      >Listen</directive> statements to listen on either multiple
-      ports or multiple addresses. In order to test each socket
-      to see if a connection is ready Apache uses
-      <code>select(2)</code>. <code>select(2)</code> indicates that a
-      socket has <em>zero</em> or <em>at least one</em> connection
-      waiting on it. Apache's model includes multiple children, and
-      all the idle ones test for new connections at the same time. A
-      naive implementation looks something like this (these examples
-      do not match the code, they're contrived for pedagogical
-      purposes):</p>
-
-      <example>
-        for (;;) {<br />
-        <indent>
-          for (;;) {<br />
-          <indent>
-            fd_set accept_fds;<br />
-            <br />
-            FD_ZERO (&amp;accept_fds);<br />
-            for (i = first_socket; i &lt;= last_socket; ++i) {<br />
-            <indent>
-              FD_SET (i, &amp;accept_fds);<br />
-            </indent>
-            }<br />
-            rc = select (last_socket+1, &amp;accept_fds, NULL, NULL, NULL);<br />
-            if (rc &lt; 1) continue;<br />
-            new_connection = -1;<br />
-            for (i = first_socket; i &lt;= last_socket; ++i) {<br />
-            <indent>
-              if (FD_ISSET (i, &amp;accept_fds)) {<br />
-              <indent>
-                new_connection = accept (i, NULL, NULL);<br />
-                if (new_connection != -1) break;<br />
-              </indent>
-              }<br />
-            </indent>
-            }<br />
-            if (new_connection != -1) break;<br />
-          </indent>
-          }<br />
-          process the new_connection;<br />
-        </indent>
-        }
-      </example>
-
-      <p>But this naive implementation has a serious starvation problem.
-      Recall that multiple children execute this loop at the same
-      time, and so multiple children will block at
-      <code>select</code> when they are in between requests. All
-      those blocked children will awaken and return from
-      <code>select</code> when a single request appears on any socket
-      (the number of children which awaken varies depending on the
-      operating system and timing issues). They will all then fall
-      down into the loop and try to <code>accept</code> the
-      connection. But only one will succeed (assuming there's still
-      only one connection ready), the rest will be <em>blocked</em>
-      in <code>accept</code>. This effectively locks those children
-      into serving requests from that one socket and no other
-      sockets, and they'll be stuck there until enough new requests
-      appear on that socket to wake them all up. This starvation
-      problem was first documented in <a
-      href="http://bugs.apache.org/index/full/467";>PR#467</a>. There
-      are at least two solutions.</p>
-
-      <p>One solution is to make the sockets non-blocking. In this
-      case the <code>accept</code> won't block the children, and they
-      will be allowed to continue immediately. But this wastes CPU
-      time. Suppose you have ten idle children in
-      <code>select</code>, and one connection arrives. Then nine of
-      those children will wake up, try to <code>accept</code> the
-      connection, fail, and loop back into <code>select</code>,
-      accomplishing nothing. Meanwhile none of those children are
-      servicing requests that occurred on other sockets until they
-      get back up to the <code>select</code> again. Overall this
-      solution does not seem very fruitful unless you have as many
-      idle CPUs (in a multiprocessor box) as you have idle children,
-      not a very likely situation.</p>
-
-      <p>Another solution, the one used by Apache, is to serialize
-      entry into the inner loop. The loop looks like this
-      (differences highlighted):</p>
-
-      <example>
-        for (;;) {<br />
-        <indent>
-          <strong>accept_mutex_on ();</strong><br />
-          for (;;) {<br />
-          <indent>
-            fd_set accept_fds;<br />
-            <br />
-            FD_ZERO (&amp;accept_fds);<br />
-            for (i = first_socket; i &lt;= last_socket; ++i) {<br />
-            <indent>
-              FD_SET (i, &amp;accept_fds);<br />
-            </indent>
-            }<br />
-            rc = select (last_socket+1, &amp;accept_fds, NULL, NULL, NULL);<br />
-            if (rc &lt; 1) continue;<br />
-            new_connection = -1;<br />
-            for (i = first_socket; i &lt;= last_socket; ++i) {<br />
-            <indent>
-              if (FD_ISSET (i, &amp;accept_fds)) {<br />
-              <indent>
-                new_connection = accept (i, NULL, NULL);<br />
-                if (new_connection != -1) break;<br />
-              </indent>
-              }<br />
-            </indent>
-            }<br />
-            if (new_connection != -1) break;<br />
-          </indent>
-          }<br />
-          <strong>accept_mutex_off ();</strong><br />
-          process the new_connection;<br />
-        </indent>
-        }
-      </example>
-
-      <p><a id="serialize" name="serialize">The functions</a>
-      <code>accept_mutex_on</code> and <code>accept_mutex_off</code>
-      implement a mutual exclusion semaphore. Only one child can have
-      the mutex at any time. There are several choices for
-      implementing these mutexes. The choice is defined in
-      <code>src/conf.h</code> (pre-1.3) or
-      <code>src/include/ap_config.h</code> (1.3 or later). Some
-      architectures do not have any locking choice made, on these
-      architectures it is unsafe to use multiple
-      <directive module="mpm_common">Listen</directive>
-      directives.</p>
-
-      <p>The <directive module="core">Mutex</directive> directive can
-      be used to change the mutex implementation of the 
-      <code>mpm-accept</code> mutex at run-time.  Special considerations
-      for different mutex implementations are documented with that 
-      directive.</p>
-
-      <p>Another solution that has been considered but never
-      implemented is to partially serialize the loop -- that is, let
-      in a certain number of processes. This would only be of
-      interest on multiprocessor boxes where it's possible multiple
-      children could run simultaneously, and the serialization
-      actually doesn't take advantage of the full bandwidth. This is
-      a possible area of future investigation, but priority remains
-      low because highly parallel web servers are not the norm.</p>
-
-      <p>Ideally you should run servers without multiple
-      <directive module="mpm_common">Listen</directive>
-      statements if you want the highest performance.
-      But read on.</p>
-
-    </section>
-
-    <section>
-
-      <title>accept Serialization - single socket</title>
-
-      <p>The above is fine and dandy for multiple socket servers, but
-      what about single socket servers? In theory they shouldn't
-      experience any of these same problems because all children can
-      just block in <code>accept(2)</code> until a connection
-      arrives, and no starvation results. In practice this hides
-      almost the same "spinning" behaviour discussed above in the
-      non-blocking solution. The way that most TCP stacks are
-      implemented, the kernel actually wakes up all processes blocked
-      in <code>accept</code> when a single connection arrives. One of
-      those processes gets the connection and returns to user-space,
-      the rest spin in the kernel and go back to sleep when they
-      discover there's no connection for them. This spinning is
-      hidden from the user-land code, but it's there nonetheless.
-      This can result in the same load-spiking wasteful behaviour
-      that a non-blocking solution to the multiple sockets case
-      can.</p>
-
-      <p>For this reason we have found that many architectures behave
-      more "nicely" if we serialize even the single socket case. So
-      this is actually the default in almost all cases. Crude
-      experiments under Linux (2.0.30 on a dual Pentium pro 166
-      w/128Mb RAM) have shown that the serialization of the single
-      socket case causes less than a 3% decrease in requests per
-      second over unserialized single-socket. But unserialized
-      single-socket showed an extra 100ms latency on each request.
-      This latency is probably a wash on long haul lines, and only an
-      issue on LANs. If you want to override the single socket
-      serialization you can define
-      <code>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</code> and then
-      single-socket servers will not serialize at all.</p>
-
-    </section>
-
-    <section>
-
-      <title>Lingering Close</title>
-
-      <p>As discussed in <a
-      href="http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt";>
-      draft-ietf-http-connection-00.txt</a> section 8, in order for
-      an HTTP server to <strong>reliably</strong> implement the
-      protocol it needs to shutdown each direction of the
-      communication independently (recall that a TCP connection is
-      bi-directional, each half is independent of the other).</p>
-
-      <p>When this feature was added to Apache it caused a flurry of
-      problems on various versions of Unix because of a
-      shortsightedness. The TCP specification does not state that the
-      <code>FIN_WAIT_2</code> state has a timeout, but it doesn't prohibit it.
-      On systems without the timeout, Apache 1.2 induces many sockets
-      stuck forever in the <code>FIN_WAIT_2</code> state. In many cases this
-      can be avoided by simply upgrading to the latest TCP/IP patches
-      supplied by the vendor. In cases where the vendor has never
-      released patches (<em>i.e.</em>, SunOS4 -- although folks with
-      a source license can patch it themselves) we have decided to
-      disable this feature.</p>
-
-      <p>There are two ways of accomplishing this. One is the socket
-      option <code>SO_LINGER</code>. But as fate would have it, this
-      has never been implemented properly in most TCP/IP stacks. Even
-      on those stacks with a proper implementation (<em>i.e.</em>,
-      Linux 2.0.31) this method proves to be more expensive (cputime)
-      than the next solution.</p>
-
-      <p>For the most part, Apache implements this in a function
-      called <code>lingering_close</code> (in
-      <code>http_main.c</code>). The function looks roughly like
-      this:</p>
-
-      <example>
-        void lingering_close (int s)<br />
-        {<br />
-        <indent>
-          char junk_buffer[2048];<br />
-          <br />
-          /* shutdown the sending side */<br />
-          shutdown (s, 1);<br />
-          <br />
-          signal (SIGALRM, lingering_death);<br />
-          alarm (30);<br />
-          <br />
-          for (;;) {<br />
-          <indent>
-            select (s for reading, 2 second timeout);<br />
-            if (error) break;<br />
-            if (s is ready for reading) {<br />
-            <indent>
-              if (read (s, junk_buffer, sizeof (junk_buffer)) &lt;= 0) {<br />
-              <indent>
-                break;<br />
-              </indent>
-              }<br />
-              /* just toss away whatever is here */<br />
-            </indent>
-            }<br />
-          </indent>
-          }<br />
-          <br />
-          close (s);<br />
-        </indent>
-        }
-      </example>
-
-      <p>This naturally adds some expense at the end of a connection,
-      but it is required for a reliable implementation. As HTTP/1.1
-      becomes more prevalent, and all connections are persistent,
-      this expense will be amortized over more requests. If you want
-      to play with fire and disable this feature you can define
-      <code>NO_LINGCLOSE</code>, but this is not recommended at all.
-      In particular, as HTTP/1.1 pipelined persistent connections
-      come into use <code>lingering_close</code> is an absolute
-      necessity (and <a
-      href="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html";>
-      pipelined connections are faster</a>, so you want to support
-      them).</p>
-
-    </section>
-
-    <section>
-
-      <title>Scoreboard File</title>
-
-      <p>Apache's parent and children communicate with each other
-      through something called the scoreboard. Ideally this should be
-      implemented in shared memory. For those operating systems that
-      we either have access to, or have been given detailed ports
-      for, it typically is implemented using shared memory. The rest
-      default to using an on-disk file. The on-disk file is not only
-      slow, but it is unreliable (and less featured). Peruse the
-      <code>src/main/conf.h</code> file for your architecture and
-      look for either <code>USE_MMAP_SCOREBOARD</code> or
-      <code>USE_SHMGET_SCOREBOARD</code>. Defining one of those two
-      (as well as their companions <code>HAVE_MMAP</code> and
-      <code>HAVE_SHMGET</code> respectively) enables the supplied
-      shared memory code. If your system has another type of shared
-      memory, edit the file <code>src/main/http_main.c</code> and add
-      the hooks necessary to use it in Apache. (Send us back a patch
-      too please.)</p>
-
-      <note>Historical note: The Linux port of Apache didn't start to
-      use shared memory until version 1.2 of Apache. This oversight
-      resulted in really poor and unreliable behaviour of earlier
-      versions of Apache on Linux.</note>
-
-    </section>
-
-    <section>
-
       <title>DYNAMIC_MODULE_LIMIT</title>
 
       <p>If you have no intention of using dynamically loaded modules

Re: Performance tuning documentation

Reply via email to