After few discussions had face-to-face with some of you (Stefano on the
phone ranting about setting up Tomcat, Jeremy over lunch at my place few
weeks ago, and several others), and few odd questions popping out on the
list, I feel the need to tell you why my vision is so narrow when someone
touches the "Apache" argument.

As I said several times in the past 6 years, I've learnt how to use Apache
(1.3 first, and 2.0 lately) to suit my needs and I would never envision an
HTTP server running without it.

Given my "pragmatical" vision, it's hard to explain "why" I am so biased,
and probably the best way to come out-of-the-loophole is to share the few
things I learnt, and that make my everyday life of administrator easy...

So, those are few tips for those of you who wonder about my rants. (I should
really post those to the Wiki, but dammit, I don't know how to use it :-)


Why Apache as a front end?
--------------------------

Probably the first and most important question to answer is WHY it is so
important to have Apache HTTPd as a front-end for a website.

I believe that for anyone, there's nothing more annoying than hitting a web
page, waiting for a few seconds, and then seeing our favorite browser coming
up with "The connection was refused when attempting to contact
http://www.domain.tld/";.

In my opinion (and my boss') it is unacceptable to have a "downtime" on a
website, and if that happens, whoever connects needs to know what's going
on, or, at least, we need to tell him something: "We are sorry, but
currently http://www.domain.tld/ is unavailable because of essential system
upgrades. We expect to resume all our services in less than 10 minutes.
Please, check back later" sounds so much better (maybe with our little nice
logo, and yada yada, yada).

When once I asked to Brian Behlendorf why Apache was doing some oddities in
the code, he responded "Call it defensive programming": this explains the
entire vision behind Apache: Apache, no matter what, can _not_ "go down" and
not respond to HTTP requests. This is the essence behind it and its design
is centered around this idea, so, in my opinion (and experience) it is that
one option allowing us to achieve our goal of "zero port 80 downtime".

Apache's design enforces a multi-process model: there is always a minimal
wrapper bound to port 80 (as safe and minimal as possible), spawning new OS
processes per request doing the work. This allows that even in the worst
case scenario (a segmentation violation in the code that dumps the entire OS
process), something will be sent back to the client.

A Java-based web server can not achieve this. Java is a single-process
environment and if something happens to it, it will just exit, unbinding
port 80 and leaving our clients with "connection refused".

There is another issue, important one, about security. Java does not support
switching user-ID after it's started, and under UNIX operating systems,
everyone knows that noone apart from "root" can bind to ports < 1024.

In our case it is a problem, I either decide to run my service as root (and
that is NOT a good idea), or I bind to some port > 1024 (usually 8080). But
then, the complexity arises when forwarding requests for port 80 (our usual
HTTP service) to a port above 1024 (8080). Either firewall packages, or port
remappers, any of those solution involves a some-degree of complexity.

Apache avoids all that. Being native, it can bind to ports < 1024 and run as
a non-privileged user, allowing us to run our servlet container (as well) as
a non privileged user.

But those are not the only advantages, Apache helps us in much much better
ways, and I hope, at this point to be able to show you what and how...


What Apache? How Apache?
------------------------

A very personal choice is what version of Apache you want to run. In my
following examples I will assume you're going to use Apache 2.0, as it is
now _stable_ and much more performing than the "old" 1.3.

It's now several months that most of the sites hosted by VNU (my employer)
are running 2.0 (apart from our old legacy "rolaren" server) and I never had
in my personal experience a single problem.

Apache 2.0, though, is somehow more "difficult" to build and configure: the
most difficult choice is the selection of the MPM (Multi-Process Module) to
use. Read the manual to choose what suits you best, but in my case the
"worker" MPM (multi-process, multi-threaded) is the one giving me the best
performance/solidity ratio.

The "www.apache.org" website, on the other hand, uses the "prefork" MPM
(multi-process, single threaded, exactly as Apache 1.3 did), but I feel that
under certain operating system it is slightly slower than "worker". Your
choice.

As a reference, I configure Apache 2.0 in the following way:

./configure \
    --with-mpm=worker \
    --enable-modules=all \
    --enable-mods-shared=all \
    --enable-proxy \
    --enable-proxy-http \
    --disable-ipv6

Basically, I use the "worker" module, all modules are compiled as DSO
modules (dynamically loaded, so that I can disable the ones I don't use),
including the proxy/proxy-http module, and I don't care for IPv6 support.


Connecting Cocoon
-----------------

As Stefano, I had several headaches trying to connect Apache and [name your
Servlet container of choice]. Mod_JK (JK2) doesn't work for me, mod_webapp
works for me, but just for me because I'm the author, and was forced to
sadly abandon its development, the only solution I see (and the one which
works best for me currently) is mod_proxy.

Mod_proxy is a nice little module, especially in Apache 2.0 where its
caching part is completely decoupled in another module (mod_cache), it's
very small, lightweight, and does the job...

Plus, you have the advantage to choose whatever servlet container you have
in the backend: Orion, WebSphere, Tomcat, Jetty, you name it, it supports
HTTP :-) (well, apart from ServletExec, but that's another story, and if
someone wants some hints, let me know).

Connecting Cocoon is _simple_: all you have to do is configure your servlet
container to run on a high port (8080 for example) and make sure it runs as
a non privileged user, make sure that it knows that is a proxied-HTTP server
(Cocoon, Jetty, Resin, Orion, ... They all have this concept, check out the
documentation), and configure Apache with those two lines:

    ProxyPass        / http://localhost:8080/
    ProxyPassReverse / http://localhost:8080/

The first one tells Apache that any whatsoever request (from / onwards) gets
"proxied" to localhost:8080, and the second one tells Apache to make sure
that any "Location" HTTP header coming back gets rewritten accordingly (just
in case if your Servlet container doesn't let you set the "proxied"
configuration).

That's _IT_. It runs, and it runs smoothly.


Trivially serving static files
------------------------------

Now, Apache is _definitely_ faster than any Java based servlet container in
serving files straight to HTTP clients. This is just because nowadays it
uses a kernel-based function called "sendfile", that makes its performances
far greater than anything than Java can do.

Using mod_proxy and the set of ProxyPass configuration directive doesn't
allow us to set a "pattern" to associate to resources to be served straight
off the filesystem, it only allows us to define exclusion lists and
processing lists.

In my example, then I will rewrite my configuration to make Apache serve
everything beginning with "/static/" straight out of my web-application,
without even touching the servlet container:

    # Make sure that my document root points to the root of the web
    # application (where the WEB-INF is located, for instance).
    DocumentRoot /export/webapps/cocoon

    # We don't proxy any request beginning with the keyword "/static/".
    # So, for example, "/static/logo.gif" will be served directly by
    # Apache from the "/export/webapps/cocoon/static/logo.gif file"
    ProxyPass        /static/ !

    # Another one for "favicon.ico", so that explorer and mozilla are happy
    ProxyPass        /favicon.ico !
    
    # And now we send back to the servlet engine everyting else that does
    # not begin with "/static/" or "/favicon.ico"
    ProxyPass        / http://localhost:8080/
    ProxyPassReverse / http://localhost:8080/

Simple, the "!" keyword in ProxyPass means "don't" :-)


The holding page
----------------

If you used one of the configurations above, you'll see that if your servlet
container is not respondong on port 8080 for any reason, you will get a nice
"Bad Gateway" error page (HTTP 502 Error).

As that page is quite ugly (I have to admit that the HTTPd freaks are not
good HTML artists), you might want to point your clients to a
better-designed page (or containing some lame excuse on why your servlet
container is down).

You can do that easily (again), by using the ErrorDocument directive. Note
that, though, the ErrorDocument directive requires a file (so it needs to be
non proxied). Either you get down nasty with your mod_alias configurations,
or simply, use the second configuration and include it in your webapp as a
static file. Anyway, what you have to specify in that case is simply:

    # If mod_proxy cannot connect to the servlet container, we want
    # to display a nice static page saying the reason
    ErrorDocument 502 /static/unavailable.html

If (for example) you wanted to use Server-Side-Includes to render your page
(it might be nice to display something like the host name, or the time when
the request was received, you can do so by using SHTML files. This is what I
use at home:

<html>
  <head>
    <title><!--#echo var="SERVER_NAME"-->: server off-line</title>
  </head>
  <body>
    <h3><!--#echo var="SERVER_NAME"-->: server off-line</h3>
    <p>
      We are sorry, but the server is temporarily unavailable due to
      maintenance. Our team is working to restore service as soon as
      possible.<br />
      In case of troubles, please feel free to contact our webmaster
      sending an email to
      <a href="mailto:<!--#echo var="SERVER_ADMIN"-->">
        &lt;<!--#echo var="SERVER_ADMIN"-->&gt;
      </a>.
    </p>
    <hr/>
    <p>
      <small>
        <!--#echo var="SERVER_SOFTWARE"--> running on
        <!--#echo var="SERVER_NAME"-->:<!--#echo var="SERVER_PORT"-->
        at <!--#echo var="DATE_LOCAL"-->.
      </small>
    </p>
  </body>
</html>

And to make it work properly this is how your httpd.conf will have to look
like:

    # Make sure that Server Side Includes are processed and sent
    # to the client with mime-type as text/html
    AddType text/html .shtml
    AddOutputFilter Includes .shtml

    # Make sure that our SHTMLs are processed in the static
    # directory
    <Directory "/export/webapps/cocoon">
        Options IncludesNoExec
    </Directory>

    # If mod_proxy cannot connect to the servlet container, we want
    # to display a nice static page saying the reason. This is a
    # SHTML page (using the Server-Side-Includes filter)
    ErrorDocument 502 /static/unavailable.shtml


Putting mod_proxy all together in one
-------------------------------------

Ok, now that we have seen how each piece gets together, let's try to put
them all together, adding also that any request to "/WEB-INF/" should be
forbidden straight away (there's no point in proxying them when we know that
the servlet container will block them all)

    # Make sure that my document root points to the root of the web
    # application (where the WEB-INF is located, for instance).
    DocumentRoot /export/webapps/cocoon

    # Make sure that Server Side Includes are processed and sent
    # to the client with mime-type as text/html
    AddType text/html .shtml
    AddOutputFilter Includes .shtml

    # Make sure that our SHTMLs are processed in the static
    # directory
    <Directory "/export/webapps/cocoon">
        Options +IncludesNoExec
    </Directory>

    # Block the stupid "WEB-INF" pseudo-url (god I wish web-applications
    # were designed with some intelligence... Ok, my fault as well)
    <Location /WEB-INF>
        Order deny,allow
        Deny from all
    </Location>

    # If mod_proxy cannot connect to the servlet container, we want
    # to display a nice static page saying the reason. This is a
    # SHTML page (using the Server-Side-Includes filter)
    ErrorDocument 502 /static/unavailable.shtml

    # We don't proxy any request beginning with the keyword "/static/".
    # So, for example, "/static/logo.gif" will be served directly by
    # Apache from the "/export/webapps/cocoon/static/logo.gif file"
    ProxyPass        /static/ !

    # Another one for "favicon.ico", so that explorer and mozilla are happy
    ProxyPass        /favicon.ico !
    
    # And now we send back to the servlet engine everyting else that does
    # not begin with "/static/" or "/favicon.ico"
    ProxyPass        / http://localhost:8080/
    ProxyPassReverse / http://localhost:8080/

Simple, easy, beautiful...


A more complex example: mod_rewrite
-----------------------------------

This is all nice and clean, but if we want to be really nasty, and starting
to serve (for example) all our GIF and JPG files straight via Apache, we
would need to use mod_rewrite.

I know, mod_rewrite is ugly, it uses PERL regular expressions (so, well,
it's even slightly slower), but mod_proxy is way to crummy, it's either "in"
or "out", and it takes over the whole world (you can't really do much else
after you said you're going to forward a URL).

So, mod_rewrite, even if it's ugly, even if it's slower, _is_ our solution.
With a couple of rules, we can take the configuration written above to the
extreme, and basically do WHATEVER we want with a URL _before_ it even knows
about a possible servlet container in the backend.

I suggest you to read _carefully_ the mod_rewrite documentation, but, as a
start, I'm going to rewrite what's written above, using rewrite and its
flags, from here on, you're on your own :-) :-)

    # Make sure that my document root points to the root of the web
    # application (where the WEB-INF is located, for instance).
    DocumentRoot /export/webapps/cocoon

    # Make sure that Server Side Includes are processed and sent
    # to the client with mime-type as text/html
    AddType text/html .shtml
    AddOutputFilter Includes .shtml

    # Make sure that our SHTMLs are processed in the static
    # directory
    <Directory "/export/webapps/cocoon">
        Options +IncludesNoExec
    </Directory>

    # If mod_proxy cannot connect to the servlet container, we want
    # to display a nice static page saying the reason. This is a
    # SHTML page (using the Server-Side-Includes filter)
    ErrorDocument 502 /static/unavailable.shtml

    # The nastiness begins, let's fire up the "rewrite engine"
    RewriteEngine On

    # Everything that starts with "/static" or "/static/" is served straight
    # through: no redirection, no proxying, no nothing, and the [L] flag
    # implies that if this rule is matched, no other matching must be
    # performed
    RewriteRule "^/static/?(.*)" "$0" [L]

    # Everything that starts with a NON-CASE-SENSITIVE match (the NC flag)
    # of "/WEB-INF" or "/WEB-INF/" is forbidden (the F flag). And again,
    # this is the last rule (the L flag), nothing will be processed by the
    # rewrite engine if this rule is matched
    RewriteRule "^/WEB-INF/?(.*)" "$0" [L,F,NC]

    # Everything ending in ".gif", ".jpg" or ".jpeg" will be served again
    # directly by Apache, no need to bother the servlet container. As above
    # this is the last rule as specified by the [L] flag at the end
    RewriteRule "^/(.*)\.gif$" "$0" [L]
    RewriteRule "^/(.*)\.(jpg|jpeg)$" "$0" [L]

    # Everything else not matched above needs to go to the servlet container
    # via HTTP listening on port 8080. The [P] flag (which is required)
    # implies that our requests will be handled by mod_proxy.
    RewriteRule "^/(.*)" "http://localhost:8080/$1"; [P]

    # Make sure that if the servlet container specifies a "Location" HTTP
    # header during redirection starting with "http://localhost:8080/";, we
    # can handle it and return to our client the effective (not real)
    # location we want to redirect them to. This is _essential_.
    ProxyPassReverse / http://localhost:8080/

As I mentioned before, ugly, but _really_ effective. In few lines we connect
the HTTP-based servlet container running Cocoon to Apache, we make sure that
if the servlet container falls over, we direct people to an appropriate
holding page, we serve all that is under /static, all GIF and all JPEG files
straight off without touching Cocoon and all the rest through our sitemap,
and as a free bonus, everything that ends in ".shtml" (from disk or from the
sitemap) will be passed through the Apache "Server-Side-Includes" filter
(mod_include, which is ugly, but sometimes _really_ effective)...


Conclusions
-----------

I hope to have cleared some of the doubts on Apache, and why I love it so
much... It is a hub, a hub embracing your website and making it work better,
faster, more reliably and exactly fine-tuned precisely as you (or your boss)
like it.

And you can trust Apache, I believe that our spirit, the spirit of the
entire Cocoon community is built on top on the original HTTPd vision of
let's make things work so nicely that the world won't have to look for
another solution...

HTTPd does it in its little piece of being an HTTP hub, Jetty does it in its
little piece of being a servlet container, Cocoon does it in its little
piece of being the best "web-application" framework available on the planet
right now. Together, those three little pieces _will_ conquer the world.

Have fun...

    Pier

(BTW, where the hell is Tomcat in this picture? :-)


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to