Dan McCormick wrote:
> 
> Hi,
> 
> After struggling with trying to figure out mod_proxy's caching algorithm
> and noting from the list archive's that others had, too -- and due to
> the dearth of existing documentation on the subject -- I came up with
> some documentation below by sifting through the source code.  Most of it
> isn't explicitly mod_perl-related, but I hope those trying to set it up

Thanks for the read.  Very enlightening.  I'm guessing
the dir levels matters because it lets the files be
spread over that many more directories, so there isn't a 
large directory hashing penalty on a HUGE number of files.
5 is probably a bit much though if it really creates 4-5
directories for each file it stores, and if you are using
this only for a proxy in reverse mode for mod_perl, its likely
you could get away with 2-3 levels.

I think it would be interesting if you chronicled the capacity 
improvements to your site using the mod_proxy server like this.  
I don't know how well mod_proxy does this caching from a performance
perspective, and it might be nice to see some numbers that
one could later compare with some of the commercial caching
products.

--Joshua

> will find it useful.  Included at the end is a Perl script to determine
> the filename that mod_proxy uses to cache files, which is helpful in
> manually cleaning up the cache.  If anyone has comments or
> suggestions, please let me know.
> 
> Thanks,
> Dan
> 
> ------------------------------------
> 
> Setting up Apache with mod_proxy to cache content from a mod_perl server
> 
> The documentation for mod_proxy can be found at
> http://httpd.apache.org/docs/mod/mod_proxy.html.  Unfortunately, aside
> from the configuration parameters, not much detail is provided on how to
> set up mod_proxy to cache pages from a downstream server.  This
> explanation hopes to fill that void.  Most of its content was derived by
> going through the mod_proxy.c, proxy_cache.c, and proxy_util.c source
> files and comments in the src/modules/proxy directory of the Apache
> 1.3.12 distribution.
> 
> * The Short Story
> 
> In short, mod_proxy will cache all requests that contain a Last-Modified
> header and an Expires header.  You can insert this into your mod_perl
> scripts with something like this:
> 
> use Apache::File ();
> use HTTP::Date;
> 
> $r->set_last_modified((stat $r->finfo)[9]); # see Eagle book p. 493 for
> explanation
> $r->header_out('Expires', HTTP::Date::time2str(time + 24*60*60)); #
> expires in one day
> 
> The page will live in the cache until the current time passes the time
> defined by the Expires header or the time since the file was cached
> exceeds the CacheMaxExpire parameter as set in the server config file.
> 
> * The Long Story
> 
> To understand how the caching proxy server works, let's trace the flow
> of two simple HTTP exchanges for the same file, from the browser request
> to the returned page.
> 
> - The browser makes a request to the proxy server like this:
> 
> GET /index.html HTTP/1.0
> 
> - The proxy server takes the URL and converts it to a filename on your
> filesystem.  This filename has no resemblance to the actual URL.
> Instead, it is an MD5 hash of the fully qualified URL (e.g.
> http://www.myserver.com:80/mypage.html) to the document and is broken up
> in a number of directory levels, as defined by the CacheDirLevels
> parameter in the config file.  (WHY DOES IT MATTER HOW MANY DIR LEVELS
> ARE IN THE CACHE?)  Each of these directories will have a certain number
> of characters in its name, as defined by the CacheDirLength parameter in
> the config file.  The directories will live under CacheRoot, also
> defined in the config file.  For example, /index.html might be converted
> to /proxy_cache/m/EYRopVKBHMrHd2VF6WXOQ (with CacheDirLevels and
> CacheDirLength set to 1 and CacheRoot set to /proxy_cache).
> 
> - For this example, we'll assume that at this point the cached file does
> not exist.  The proxy server then consequently forwards the request to
> the mod_perl server and gets a response back.  The response will then be
> cached UNLESS any of the following conditions are true
> (ap_proxy_cache_update):
>  - The HTTP status returned by the mod_perl server is not one of OK,
> HTTP_MOVED_PERMANENTLY, or HTTP_NOT_MODIFIED
>  - The response does not contain an Expires header
>  - The response contains an Expires header that Apache can't parse
>  - The HTTP status is OK but there's not a Last-Modified header
>  - The mod_perl server sent only an HTTP header
>  - The mod_perl server sent an Authorization field in the header
> (Furthermore, if any of the above conditions are met, any existing
> cached file will be deleted.)
> 
> - If the server decides to cache the file, it will store the file
> exactly as it was received from the mod_perl server, with the addition
> of a one-line header at the start of the file.  This header contains the
> following information in the following format:
> <current time> <last modified time> <expiration time> <"version">
> <content length>
> 
> All times are stored as hex seconds since 1970 and are taken from the
> HTTP header sent by the mod_perl server.  If the current time cannot be
> parsed from this header, the proxy server determines the current time
> itself and uses that; if the Last Modified time cannot be parsed, it is
> set to the Last Modified time of the existing cached file, if it exists;
> if the Last Modified time is in the future, it is set to the current
> time as determined previously; if the Expires time cannot be parsed and
> a Last Modified time exists from the previous step, then the Expires
> time is set to "now + min((date - lastmod) * factor, maxexpire)" (as
> noted in the source code comments) where factor and maxexpire are the
> CacheLastModifiedFactor and CacheMaxExpire parameters in the config
> file; if the Expires time cannot be parsed and there is no Last Modified
> time, then the Expires time is set to "now + defaultexpire", where
> 'defaultexpire' is the CacheDefaultExpire parameter in the config file.
> 
> The "version" number stored in this file is an integer that is
> incremented each time the file is overwritten by a fresh response from
> the mod_perl server.
> 
> The permissions on the cached files are quite strict: they can be read
> and written only by the web server user.  Furthermore, the directories
> created in the cache filesystem can only be viewed by the web server
> user.
> 
> - If the status sent by the mod_perl server was a "304 Not Modified"
> header and the "Last Modified" time, as determined in the steps above,
> is before the "If-Modified-Since" time sent by the browser, then the
> proxy server sends a "304 Not Modified" response to the browser.
> Otherwise, the full file, as returned by the mod_perl server, is sent to
> the browser.
> 
> - Time passes.
> 
> - The browser makes another request for the file.  The URL is again
> converted to a filename and this time the file is found in the cache.
> At this point, the following checks are performed (ap_proxy_cache_check)
> and, if all are true, the server proceeds to the next step.  If any are
> false, the server does not use the cached file:
>  - The request is a GET request
>  - There is no 'Pragma: No-Cache' in the HTTP header sent by the browser
>  - There is no 'Authorization' field in the HTTP header sent by the
> browser
> 
> (NOTE this should mean that all HEAD requests are passed through to the
> mod_perl server.  However, in practice, this does not seem to be the
> case.  Instead, HEAD requests are passed through unless there is an
> unexpired file in the cache (retrieved via a previous GET request), in
> which case that is used.  I may be misreading the code -- the check for
> the GET request is on line 714 of proxy_cache.c, if you're interested.)
> 
> - If the above conditions are true, the proxy server opens the cached
> file, examines the first line of data, and follows this logic:
> 
>         If the "Expires" time listed in the first line of the cached
> file has not been reached then it will use the cached file.  It must
> then decide whether to send the file or just send a "304 Not Modified"
> header.  If the "If-Modified-Since" time sent by the browser is greater
> than or equal to the "Last-Modified" time in the cached file then the
> proxy server sends a "304 Not Modified" response back to the browser,
> telling it to use its locally cached copy of the file; otherwise, it
> sends the cached file.
> 
>         If the "Expires" time *has* been reached, the proxy server then
> re-requests the file from the mod_perl server, sends that back to the
> client, and writes the new response to the cache file.
> 
> Various Question:
> 
> * Is / cached separately from /index.html?
> 
> Yes.  The cache filenames are based on the URL before any aliasing takes
> place.
> 
> * How can I tell if mod_proxy is caching requests?
> 
> Open two terminal windows and tail the output of the access logs (i.e.,
> 'tail -f access_log') on both the proxy and the mod_perl server.  Then,
> use your browser to make a request to the proxy server and watch both
> logs.  If you see your request in the mod_perl server access log, the
> file's not being cached; if you don't, it is.
> 
> * Can I store the cache on an NFS server used by two or more httpd
> binaries serving the same document root?
> 
> Yes.  The servers will all use the same names for the cache files.
> 
> * How are HEAD requests handled?
> 
> HEAD requests are passed to the mod_perl server UNLESS the URL has been
> cached previously with a GET request, in which case they are served from
> the cache.
> 
> * Is there a quick hack I can use to include the Expires and
> Last-Modified headers in my Apache::ASP scripts?
> 
> Yes.  Throw this into your global.asa file:
> 
> sub Script_OnStart {
>         $Response->{Expires} = 60*60*24; # expires in a day
>         my $last_modified = (stat $0)[9];
>         $main::Response->AddHeader ('Last-Modified',
> HTTP::Date::time2str $last_modified);
> }
> 
> This will add an Expires field (of one day in the future) and a
> Last-Modified field (of the file modification time) to all your pages.
> 
> * How does the garbade collection system work?
> 
> I don't know; I didn't investigate that.  Sorry.  Presumably, it combs
> the cache every CacheGcInterval hours, as defined in the config file,
> and deletes files if the cache is greater than CacheSize, also defined
> in the config file.  Exactly *which* files are deleted is still a
> mystery.
> 
> * How do I clear the cache?
> 
> The proxy server will re-request a file when it's expiration date, as
> stored in the first line of the cached file, has been reached.  So you
> could write a routine to change that expiration date.  Or you could just
> delete the file.  But finding the file to delete is tricky.  Here's a
> script that was ported from the Apache C code that should work (NOTE
> this works only for case-sensitive filesystems; a slightly separate
> algorithm is used for case-insensitive filesystems -- see the
> proxy_util.c in the Apache sources):
> 
> #!/usr/bin/perl
> # Convert $URL to a mod_proxy cache filename
> # Ported blindly from src/modules/proxy/proxy_util.c in the Apache
> 1.3.12 distribution
> 
> use strict;
> use Digest::MD5 qw(md5);
> 
> my $URL = 'http://www.myserver.com:80/myfile.html'; # this should be the
> URL that the proxy server is fetching from the mod_perl server
> 
> my $ndepth = 1; # set to CacheDirLevels in your proxy conf file
> my $nlength = 1; # set to CacheDirLength in your proxy conf file
> 
> my @digest = split //, md5($URL);
> my @enc_table = split //,
> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_@";
> 
> my $x = ''; my @tmp = ();
> my ($i, $k, $d);
> for ($i = 0, $k = 0; $i < 15; $i += 3) {
>     $x = (ord($digest[$i]) << 16) | (ord($digest[$i + 1]) << 8) |
> ord($digest[$i + 2]);
>     $tmp[$k++] = $enc_table[$x >> 18];
>     $tmp[$k++] = $enc_table[($x >> 12) & 0x3f];
>     $tmp[$k++] = $enc_table[($x >> 6) & 0x3f];
>     $tmp[$k++] = $enc_table[$x & 0x3f];
> }
> 
> # one byte left
> $x = ord($digest[15]);
> $tmp[$k++] = $enc_table[$x >> 2];   # use up 6 bits
> $tmp[$k++] = $enc_table[($x << 4) & 0x3f];
> 
> # now split into directory levels
> 
> my @val = ();
> for ($i = $k = $d = 0; $d < $ndepth; ++$d) {
> #   memcpy(&val[i], &tmp[k], nlength);
>     @val[$i..($i+$nlength)] = @tmp[$k..($k+$nlength)];
> 
>     $k += $nlength;
>     $val[$i + $nlength] = '/';
>     $i += $nlength + 1;
> }
> 
> #memcpy(&val[i], &tmp[k], 22 - k);
> @val[$i..($i+22-$k)] = @tmp[$k..22];
> 
> print join ('', @val), "\n";

Reply via email to