scale_etoys app_servers.png code_structure.png etoys.pod machine_layout.png proxy_architecture.png proxy_servers.png search_servers.png session_tracking.png

stas 29 May 2002 04:47:01 -0000

stas        02/05/28 21:46:59

  Modified:    src/docs/tutorials Changes.pod config.cfg
  Added:       src/docs/tutorials/mod_perl_tricks mod_perl_tricks.pod
               src/docs/tutorials/scale_etoys app_servers.png
                        code_structure.png etoys.pod machine_layout.png
                        proxy_architecture.png proxy_servers.png
                        search_servers.png session_tracking.png
  Log:
  * Added mod_perl_tricks::mod_perl_tricks by Lincoln Stein. Changed it
    to meet our POD style, and some references to the CPAN. [Per Einar]
  
  * Added scale_etoys::etoys tutorial by Perrin Harkins. [Per Einar]
  Submitted by: per einar
  
  Revision  Changes    Path
  1.2       +10 -7     modperl-docs/src/docs/tutorials/Changes.pod
  
  Index: Changes.pod
  ===================================================================
  RCS file: /home/cvs/modperl-docs/src/docs/tutorials/Changes.pod,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- Changes.pod       10 May 2002 07:43:22 -0000      1.1
  +++ Changes.pod       29 May 2002 04:46:59 -0000      1.2
  @@ -9,17 +9,20 @@
   
   The most recent changes are listed first.
   
  -=head1 ???
  +=head1 ...
   
  -* started the Tutorials docset [Thomas Klausner]
  +* 
   
  -* browserbugs::browserbugs moved here
  +=head1 Wed May 29 12:41:24 SGT 2002
  +
  +* Added mod_perl_tricks::mod_perl_tricks by Lincoln Stein. Changed it
  +  to meet our POD style, and some references to the CPAN. [Per Einar]
   
  -=head1 Thu Apr 18 09:04:00 CET 2002
  +* Added scale_etoys::etoys tutorial by Perrin Harkins. [Per Einar]
   
  -* templates::choosing: fixed the table "Matrix", which wasn't
  -  displayed because of bug in Pod::POM. (Per Einar Ellefsen
  -  E<lt>per.einar (at) skynet.beE<gt>)
  +* browserbugs::browserbugs moved here
  +
  +* started the Tutorials docset [Thomas Klausner]
   
   =head1 Sat Sep 15 19:45:41 SGT 2001
   
  
  
  
  1.3       +5 -1      modperl-docs/src/docs/tutorials/config.cfg
  
  Index: config.cfg
  ===================================================================
  RCS file: /home/cvs/modperl-docs/src/docs/tutorials/config.cfg,v
  retrieving revision 1.2
  retrieving revision 1.3
  diff -u -r1.2 -r1.3
  --- config.cfg        10 May 2002 07:55:43 -0000      1.2
  +++ config.cfg        29 May 2002 04:46:59 -0000      1.3
  @@ -6,11 +6,15 @@
       title => "Tutorials",
   
       abstract => <<EOB,
  -mod_perl related tutorials.
  +mod_perl related tutorials, teaching you things not only about
  +mod_perl, but also about any related topics of great interest to
  +mod_perl programmers.
   EOB
   
       chapters => [qw(
           templates/comparison.pod
  +        tricks/cool_tricks.pod
  +        scale_etoys/etoys.pod
           browserbugs/browserbugs.pod
           Changes.pod
       )],
  
  
  
  1.1                  
modperl-docs/src/docs/tutorials/mod_perl_tricks/mod_perl_tricks.pod
  
  Index: mod_perl_tricks.pod
  ===================================================================
  =head1 NAME
  
  Cute Tricks With Perl and Apache
  
  =head1 Description
  
  Perl and Apache play very well together, both for administration and
  coding. However, adding mod_perl to the mix creates a heaven for an
  administrator/programmer wanting to do cool things in no time!
  
  =head1 Part I: Web Site Care and Feeding
  
  These scripts are designed to make your life as a Webmaster easier,
  leaving you time for more exciting things, like tango lessons.
  
  =head2 Logs! Logs! Logs!
  
  Left to their own devices, the log files will grow without limit,
  eventually filling up your server's partition and bringing things to a
  grinding halt.  But wait!  Don't turn off logging or throw them away.
  Log files are your friends.
  
  =head3 Log rotation
  
  Script I.1.1 shows the basic script for rotating log files.  It
  renames the current "access_log" to "access_log.0", "access_log.0" to
  "access_log.1", and so on.  The oldest log gets deleted.  Run it from
  a cron job to keep your log files from taking over.  The faster your
  log files grow, the more frequently you should run the script.
  
  
    Script I.1.1: Basic Log File Rotation
    -------------------------------------
    #!/usr/local/bin/perl
    $LOGPATH='/usr/local/apache/logs';
    @LOGNAMES=('access_log','error_log','referer_log','agent_log');
    $PIDFILE = 'httpd.pid';
    $MAXCYCLE = 4;
    
    chdir $LOGPATH;  # Change to the log directory
    foreach $filename (@LOGNAMES) {
       for (my $s=$MAXCYCLE; $s >= 0; $s-- ) {
           $oldname = $s ? "$filename.$s" : $filename;
           $newname = join(".",$filename,$s+1);
           rename $oldname,$newname if -e $oldname;
       }
    }
    kill 'HUP',`cat $PIDFILE`;
  
  =head3 Log rotation and archiving
  
  But some people don't want to delete the old logs.  Wow, maybe some
  day you could sell them for a lot of money to a marketing and
  merchandising company!  Script I.1.2 appends the oldest to a gzip
  archive.  Log files compress extremely well and make great bedtime
  reading.
  
  
    Script I.1.2: Log File Rotation and Archiving
    ---------------------------------------------
    #!/usr/local/bin/perl
    $LOGPATH    = '/usr/local/apache/logs';
    $PIDFILE    = 'httpd.pid';
    $MAXCYCLE   = 4;
    $GZIP       = '/bin/gzip';
    
    @LOGNAMES=('access_log','error_log','referer_log','agent_log');
    %ARCHIVE=('access_log'=>1,'error_log'=>1);
    
    chdir $LOGPATH;  # Change to the log directory
    foreach $filename (@LOGNAMES) {
      system "$GZIP -c $filename.$MAXCYCLE >> $filename.gz" 
           if -e "$filename.$MAXCYCLE" and $ARCHIVE{$filename};
       for (my $s=$MAXCYCLE; $s >= 0; $s-- ) {
           $oldname = $s ? "$filename.$s" : $filename;
           $newname = join(".",$filename,$s+1);
           rename $oldname,$newname if -e $oldname;
       }
    }
    kill 'HUP',`cat $PIDFILE`;
  
  
  =head3 Log rotation, compression and archiving
  
  What's that?  Someone broke into your computer, stole your log files
  and now B<he's> selling it to a Web marketing and merchandising
  company?  Shame on them.  And on you for letting it happen.  Script
  I.1.3 uses I<idea> (part of the SSLEay package) to encrypt the log
  before compressing it.  You need GNU tar to run this one.  The log
  files are individually compressed and encrypted, and stamped with the
  current date.
  
    Script I.1.3: Log File Rotation and Encryption
    ----------------------------------------------
    #!/usr/local/bin/perl
    use POSIX 'strftime';
    
    $LOGPATH     = '/home/www/logs';
    $PIDFILE     = 'httpd.pid';
    $MAXCYCLE    = 4;
    $IDEA        = '/usr/local/ssl/bin/idea';
    $GZIP        = '/bin/gzip';
    $TAR         = '/bin/tar';
    $PASSWDFILE  = '/home/www/logs/secret.passwd';
    
    @LOGNAMES=('access_log','error_log','referer_log','agent_log');
    %ARCHIVE=('access_log'=>1,'error_log'=>1);
    
    chdir $LOGPATH;  # Change to the log directory
    foreach $filename (@LOGNAMES) {
        my $oldest = "$filename.$MAXCYCLE";
        archive($oldest) if -e $oldest and $ARCHIVE{$filename};
        for (my $s=$MAXCYCLE; $s >= 0; $s-- ) {
            $oldname = $s ? "$filename.$s" : $filename;
            $newname = join(".",$filename,$s+1);
            rename $oldname,$newname if -e $oldname;
        }
    }
    kill 'HUP',`cat $PIDFILE`;
    
    sub archive {
        my $f = shift;
        my $base = $f;
        $base =~ s/\.\d+$//;
        my $fn = strftime("$base.%Y-%m-%d_%H:%M.gz.idea",localtime);
        system "$GZIP -9 -c $f | $IDEA -kfile $PASSWDFILE > $fn";
        system "$TAR rvf $base.tar --remove-files $fn";
    }
  
  =head3 Log Parsing
  
  There's a lot you can learn from log files.  Script I.1.4 does the
  basic access log regular expression match.  What you do with the
  split-out fields is limited by your imagination.  Here's a typical log
  entry so that you can follow along (wrapped for readability):
  
    portio.cshl.org - - [03/Feb/1998:17:42:15 -0500]  
         "GET /pictures/small_logo.gif HTTP/1.0" 200 2172
  
    Script I.1.4: Basic Log Parsing
    -------------------------------
    #!/usr/local/bin/perl
    
    $REGEX=/^(\S+) (\S+) (\S+) \[([^]]+)\] "(\w+) (\S+).*" (\d+) (\S+)/;
    while (<>) {
       ($host,$rfc931,$user,$date,$request,$URL,$status,$bytes) = m/$REGEX/o;
        &collect_some_statistics;
    }
    &print_some_statistics;
    
    sub collect_some_statistics {
      # for you to fill in
    }
   
    sub print_some_statistics {
      # for you to fill in
    }
  
  
  Script I.1.5 scans the log for certain status codes and prints out the
  top URLs or hosts that triggered them.  It can be used to get
  quick-and-dirty usage statistics, to find broken links, or to detect
  certain types of breakin attempts.  Use it like this:
  
   % find_status.pl -t10 200 ~www/logs/access_log
  
    TOP 10 URLS/HOSTS WITH STATUS CODE 200:
  
      REQUESTS  URL/HOST
      --------  --------
        1845    /www/wilogo.gif
        1597    /cgi-bin/contig/sts_by_name?database=release
        1582    /WWW/faqs/www-security-faq.html
        1263    /icons/caution.xbm
         930    /
         886    /ftp/pub/software/WWW/cgi_docs.html
         773    /cgi-bin/contig/phys_map
         713    /icons/dna.gif
         686    /WWW/pics/small_awlogo.gif
  
  
    Script I.1.5: Find frequent status codes
    ----------------------------------------
    #!/usr/local/bin/perl
    # File: find_status.pl
    
    require "getopts.pl";
    &Getopts('L:t:h') || die <<USAGE;
    Usage: find_status.pl [-Lth] <code1> <code2> <code3> ...
           Scan Web server log files and list a summary
           of URLs whose requests had the one of the
           indicated status codes.
    Options:
           -L <domain>  Ignore local hosts matching this domain
           -t <integer> Print top integer URLS/HOSTS [10]
           -h           Sort by host rather than URL
    USAGE
        ;
    if ($opt_L) {
        $opt_L=~s/\./\\./g;
        $IGNORE = "(^[^.]+|$opt_L)\$";
    }
    $TOP=$opt_t || 10;
    
    while (@ARGV) {
        last unless $ARGV[0]=~/^\d+$/;
        $CODES{shift @ARGV}++;
    }
    
    while (<>) {
        ($host,$rfc931,$user,$date,$request,$URL,$status,$bytes) =
            /^(\S+) (\S+) (\S+) \[([^]]+)\] "(\w+) (\S+).*" (\d+) (\S+)/;
        next unless $CODES{$status};
        next if $IGNORE && $host=~/$IGNORE/io;
        $info = $opt_h ? $host : $URL;
        $found{$status}->{$info}++;
    }
    
    foreach $status (sort {$a<=>$b;} sort keys %CODES) {
        $info = $found{$status};
        $count = $TOP;
        foreach $i (sort {$info->{$b} <=> $info->{$a};} keys %{$info}) {
            write;
            last unless --$count;
        }
        $- = 0;  # force a new top-of-report
    }
    
    format STDOUT_TOP=
    
    TOP @## URLS/HOSTS WITH STATUS CODE @##:
        $TOP,                      $status
    
        REQUESTS  URL/HOST
        --------  --------
    .
    format STDOUT=
        @#####    @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        $info->{$i},$i
    .
  
  =head3 Offline Reverse DNS Resolution
  
  Many sites turn off reverse name look-ups in order to improve server
  performance.  The log files will contain the IP addresses of remote
  hosts, but not their DNS names.  Script I.1.6 will do the reverse name
  resolution off-line.  You can run it before the log rotation and
  archiving scripts, preferably on a machine that isn't busy serving Web
  requests at the same time.
  
  This script maintains a cache of resolved names.  Because performance
  is more important than completeness, if an address doesn't resolve
  after two seconds, it moves on to the next one and never tries that
  name again.
  
    Script I.1.6: Reverse DNS Resolution
    ------------------------------------
    #!/usr/local/bin/perl
    
    use constant TIMEOUT => 2;
    $SIG{ALRM} = sub {die "timeout"};
    
    while (<>) {
        s/^(\S+)/lookup($1)/e;
    } continue { 
        print;
    }
    
    sub lookup {
        my $ip = shift;
        return $ip unless $ip=~/\d+\.\d+\.\d+\.\d+/;
        return $CACHE{$ip} if exists $CACHE{$ip};
        my @h = eval <<'END';
        alarm(TIMEOUT);
        my @i = gethostbyaddr(pack('C4',split('\.',$ip)),2);
        alarm(0);
        @i;
    END
        $CACHE{$ip} = $h[0];
        return $CACHE{$ip} || $ip;
    }
  
  =head3 Detecting Robots
  
  I was very upset a few months ago when I did some log analysis and
  discovered that 90% of my hits were coming from 10% of users, and that
  those 10% were all robots!  Script I.1.7 is the script I used to
  crunch the log and perform the analysis.  The script works like this:
  
  =over
  
  =item 1
  
  we assume that anyone coming from the same IP address with the same
  user agent within 30 minutes is the same person/robot (not quite
  right, but close enough).
  
  =item 2
  
  anything that fetches /robots.txt is probably a robot, and a "polite"
  one, to boot.
  
  =item 3
  
  we count the total number of accesses a user agent makes.
  
  =item 4
  
  we average the interval between successive fetches.
  
  =item 5
  
  we calculate an "index" which is the number of hits over the interval.
  Robots have higher indexes than people.
  
  =item 6
  
  we print everything out in a big tab-delimited table for graphing.
  
  =back
  
  By comparing the distribution of "polite" robots to the total
  distribution, we can make a good guess as to who the impolite robots
  are.
  
    Script I.1.7: Robo-Cop
    ----------------------
    #!/usr/local/bin/perl
    
    use Time::ParseDate;
    use strict 'vars';
    
    # after 30 minutes, we consider this a new session
    use constant MAX_INTERVAL => 60*30;  
    my (%HITS,%INT_NUMERATOR,%INT_DENOMINATOR,%POLITE,%LAST,$HITS);
    
    # This uses a non-standard agent log with lines formatted like this:
    # [08/Feb/1998:12:28:35 -0500] phila249-pri.voicenet.com "Mozilla/3.01 
(Win95; U)" /cgi-bin/fortune
    
    my $file = shift;
    open (IN,$file=~/\.gz$/ ? "zcat $file |" : $file ) || die "Can't open 
file/pipe: $!"; 
    
    while (<IN>) {
        my($date,$host,$agent,$URL) = /^\[(.+)\] (\S+) "(.*)" (\S+)$/;
        next unless $URL=~/\.(html|htm|txt)$/;
    
        $HITS++;
        $host = "$host:$agent"; # concatenate host and agent
        $HITS{$host}++;
        my $seconds = parsedate($date);
        if ($LAST{$host}) {
           my $interval = $seconds - $LAST{$host};
         if ($interval < MAX_INTERVAL) {
             $INT_NUMERATOR{$host} += $interval;
             $INT_DENOMINATOR{$host}++;
         }
        }
        $LAST{$host} = $seconds;
        $POLITE{$host}++ if $URL eq '/robots.txt';
        print STDERR $HITS,"\n" if ($HITS % 1000) == 0;
    }
   
    # print out, sorted by hits
    print join("\t",qw/Client Robot Hits Interval Hit_Percent Index/),"\n";
    foreach (sort {$HITS{$b}<=>$HITS{$a}} keys %HITS) {
        next unless $HITS{$_} >= 5;             # not enough total hits to mean 
much
        next unless $INT_DENOMINATOR{$_} >= 5;  # not enough consecutive hits 
to mean much
   
        my $mean_interval = $INT_NUMERATOR{$_}/$INT_DENOMINATOR{$_};
        my $percent_hits = 100*($HITS{$_}/$HITS);
        my $index = $percent_hits/$mean_interval;
    
        print join("\t",
                $_,
                $POLITE{$_} ? 'yes' : 'no',
                $HITS{$_},
                $mean_interval,
                $percent_hits,
                $index
                  ),"\n";
    }
  
  =head3 Logging to syslog
  
  If you run a large site with many independent servers, you might be
  annoyed that they all log into their own file systems rather than into
  a central location.  Apache offers a little-known feature that allows
  it to send its log entries to a process rather than a file.  The
  process (a Perl script, natch) can do whatever it likes with the logs.
  For instance, using Tom Christiansen's C<Syslog> module to send the
  info to a remote syslog daemon.
  
  Here's what you add to the Apache httpd.conf file:
  
    <VirtualHost www.company1.com>
       CustomLog "| /usr/local/apache/bin/logger company1" common
       # blah blah
    </VirtualHost>
    
    <VirtualHost www.company2.com>
       CustomLog "| /usr/local/apache/bin/logger company2" common
       # blah blah
    </VirtualHost>
  
  Do the same for each server on the local network.  
  
  Here's what you add to each Web server's syslog.conf (this assumes
  that the central logging host has the alias hostname "loghost":
  
    local0.info                 @loghost
  
  Here's what you add to the central log host's syslog.conf:
  
    local0.info                 /var/log/web/access_log
  
  Script I.1.8 shows the code for the "logger" program:
  
    Script I.1.8 "logger"
    ---------------------
    #!/usr/local/bin/perl
    # script: logger
    
    use Sys::Syslog;
    
    $SERVER_NAME = shift || 'www';
    $FACILITY = 'local0';
    $PRIORITY = 'info';
    
    Sys::Syslog::setlogsock('unix');
    openlog ($SERVER_NAME,'ndelay',$FACILITY);
    while (<>) {
        chomp;
        syslog($PRIORITY,$_);
    }
    closelog;
  
  =head3 Logging to a relational database
  
  One of the selling points of the big commercial Web servers is that
  they can log to relational databases via ODBC.  Big whoop.  With a
  little help from Perl, Apache can do that too.  Once you've got the
  log in a relational database, you can data mine to your heart's
  content.
  
  This example uses the freeware mySQL DBMS.  To prepare, create an
  appropriate database containing a table named "access_log".  It should
  have a structure like this one.  Add whatever indexes you think you
  need.  Also notice that we truncate URLs at 255 characters.  You might
  want to use TEXT columns instead.
  
    CREATE TABLE access_log (
          when    datetime     not null,
          host  varchar(255) not null,
          method  char(4)      not null,
          url   varchar(255) not null,
          auth  varchar(50),
          browser varchar(50),
          referer varchar(255),
          status  smallint(3)  not null,
          bytes int(8)       default 0
    );
  
  Now create the following entries in httpd.conf:
  
    LogFormat "\"%{%Y-%m-%d %H:%M:%S}t\" %h \"%r\" %u \"%{User-agent}i\" 
%{Referer}i %s %b" mysql
    CustomLog "| /usr/local/apache/bin/mysqllog" mysql
  
  Script I.1.9 is the source code for mysqllog.
  
    Script I.1.9 "mysqllog"
    -----------------------
    #!/usr/local/bin/perl
    # script: mysqllog
    use DBI;
    
    use constant DSN       => 'dbi:mysql:www';
    use constant DB_TABLE  => 'access_log';
    use constant DB_USER   => 'nobody';
    use constant DB_PASSWD => '';
    
    $PATTERN = '"([^"]+)" (\S+) "(\S+) (\S+) [^"]+" (\S+) "([^"]+)" (\S+) (\d+) 
(\S+)';
   
    $db = DBI->connect(DSN,DB_USER,DB_PASSWD) || die DBI->errstr;
    $sth = $db->prepare("INSERT INTO ${\DB_TABLE} VALUES(?,?,?,?,?,?,?,?,?)") 
        || die $db->errstr;
    while (<>) {
        chomp;
        my($date,$host,$method,$url,$user,$browser,$referer,$status,$bytes) = 
/$PATTERN/o;
        $user    = undef if $user    eq '-';
        $referer = undef if $referer eq '-';
        $browser = undef if $browser eq '-';
        $bytes   = undef if $bytes   eq '-';
        
$sth->execute($date,$host,$method,$url,$user,$browser,$referer,$status,$bytes);
    }
    $sth->finish;
    $db->disconnect;
  
  NOTE: Your database will grow very quickly.  Make sure that you have a
  plan for truncating or archiving the oldest entries.  Or have a lot of
  storage space handy!  Also be aware that this will cause a lot of
  traffic on your LAN.  Better start shopping around for 100BT hubs.
  
  =head2 My server fell down and it can't get up!
  
  Web servers are very stable and will stay up for long periods of time
  if you don't mess with them.  However, human error can bring them
  down, particularly if you have a lot of developers and authors
  involved in running the site.  The scripts in this section watch the
  server and send you an email message when there's a problem.
  
  =head3 Monitoring a local server
  
  The simplest script just tries to signal the Web server process.  If
  the process has gone away, it sends out an S.O.S.  See script I.2.1
  shows the technique.  Notice that the script has to run as I<root> in
  order to successfully signal the server.
  
    Script I.2.1 "localSOS"
    -----------------------
    #!/usr/local/bin/perl
    # script: localSOS
    
    use constant PIDFILE  => '/usr/local/apache/var/run/httpd.pid';
    $MAIL                 =  '/usr/sbin/sendmail';
    $MAIL_FLAGS           =  '-t -oi';
    $WEBMASTER            =  'webmaster';
    
    open (PID,PIDFILE) || die PIDFILE,": $!\n";
    $pid = <PID>;  close PID;
    kill 0,$pid || sos();
    
    sub sos {
      open (MAIL,"| $MAIL $MAIL_FLAGS") || die "mail: $!";
      my $date = localtime();
      print MAIL <<END;
    To: $WEBMASTER
    From: The Watchful Web Server Monitor <nobody>
    Subject: Web server is down
    
    I tried to call the Web server at $date but there was
    no answer.
    
    Respectfully yours,
    
    The Watchful Web Server Monitor   
    END
      close MAIL;
    }
  
  
  =head3 Monitoring a remote server
  
  Local monitoring won't catch problems with remote machines, and
  they'll miss subtle problems that can happen when the Web server hangs
  but doesn't actually crash.  A functional test is better.  Script
  I.2.2 uses the LWP library to send a HEAD request to a bunch of
  servers.  If any of them fails to respond, it sends out an SOS.  This
  script does B<not> have to run as a privileged user.
  
    Script I.2.2 "remoteSOS"
    ------------------------
    #!/usr/local/bin/perl
    # script: remoteSOS
    
    use LWP::Simple;
    %SERVERS = (
         "Fred's server"   => 'http://www.fred.com',
         "Martha's server" => 'http://www.stewart-living.com',
         "Bill's server"   => 'http://www.whitehouse.gov'
         );
    $MAIL                 =  '/usr/sbin/sendmail';
    $MAIL_FLAGS           =  '-t -oi';
    $WEBMASTER            =  'webmaster';
    
    foreach (sort keys %SERVERS) {
       sos($_) unless head($SERVERS{$_});
    }
    
    sub sos {
      my $server = shift;
      open (MAIL,"| $MAIL $MAIL_FLAGS") || die "mail: $!";
      my $date = localtime();
      print MAIL <<END;
    To: $WEBMASTER
    From: The Watchful Web Server Monitor <nobody>
    Subject: $server is down
    
    I tried to call $server at $date but there was
    no one at home.
    
    Respectfully yours,
    
    The Watchful Web Server Monitor   
    END
      close MAIL;
    }
  
  =head3 Resurrecting Dead Servers
  
  So it's not enough to get e-mail that the server's down, you want to
  relaunch it as well?  Script I.2.3 is a hybrid of localSOS and
  remoteSOS that tries to relaunch the local server after sending out
  the SOS.  It has to be run as B<root>, unless you've made I<apachectl>
  suid to root.
  
    Script I.2.2 "webLazarus"
    ------------------------- 
    #!/usr/local/bin/perl
    # script: webLazarus
    
    use LWP::Simple;
    use constant URL       => 'http://presto.capricorn.com/';
    use constant APACHECTL => '/usr/local/apache/bin/apachectl';
    $MAIL                  =  '/usr/sbin/sendmail';
    $MAIL_FLAGS            =  '-t -oi';
    $WEBMASTER             =  '[EMAIL PROTECTED]';
    
    head(URL) || resurrect();
    
    sub resurrect {
        open (STDOUT,"| $MAIL $MAIL_FLAGS") || die "mail: $!";
        select STDOUT; $| = 1;
        open (STDERR,">&STDOUT");
    
        my $date = localtime();
        print <<END;
    To: $WEBMASTER
    From: The Watchful Web Server Monitor <nobody>
    Subject: Web server is down
    
    I tried to call the Web server at $date but there was
    no answer.  I am going to try to resurrect it now:
    
    Mumble, mumble, mumble, shazzzzammmm!
    
    END
        ;
    
        system APACHECTL,'restart';
        
        print <<END;
   
    That's the best I could do.  Hope it helped.
    
    Worshipfully yours,
    
    The Web Monitor
    END
        close STDERR;
        close STDOUT;
    }
  
  Here's the message you get when the script is successful:
  
    Date: Sat, 4 Jul 1998 14:55:38 -0400
    To: [EMAIL PROTECTED]
    Subject: Web server is down
    
    I tried to call the Web server at Sat Jul  4 14:55:37 1998 but there was
    no answer.  I am going to try to resurrect it now:
    
    Mumble, mumble, mumble, shazzzzammmm!
    
    /usr/local/apache/bin/apachectl restart: httpd not running, trying to start
    [Sat Jul  4 14:55:38 1998] [debug] mod_so.c(258): loaded module 
setenvif_module
    [Sat Jul  4 14:55:38 1998] [debug] mod_so.c(258): loaded module 
unique_id_module
    /usr/local/apache/bin/apachectl restart: httpd started
    
    That's the best I could do.  Hope it helped.
    
    Worshipfully yours,
    
    The Web Monitor
  
  =head2 Site Replication and Mirroring
  
  Often you will want to mirror a page or set of pages from another
  server, for example, to distribute the load amongst several replicate
  servers, or to keep a set of reference pages handy.  The LWP library
  makes this easy.
  
  =head3 Mirroring Single Pages
  
    % ./MirrorOne.pl
    cats.html: Not Modified
    dogs.html: OK
    gillie_fish.html: Not Modified
  
    Script I.3.1 mirrorOne.pl
    -------------------------
    #!/usr/local/bin/perl
    # mirrorOne.pl
    
    use LWP::Simple;
    use HTTP::Status;
    
    use constant DIRECTORY => '/local/web/price_lists';
    %DOCUMENTS = (
         'dogs.html'  => 'http://www.pets.com/dogs/price_list.html',
         'cats.html'  => 'http://www.pets.com/cats/price_list.html',
         'gillie_fish.html' => 'http://aquaria.com/prices.html'
         );
    chdir DIRECTORY;
    foreach (sort keys %DOCUMENTS) {
       my $status = mirror($DOCUMENTS{$_},$_);
       warn "$_: ",status_message($status),"\n";
    }
  
  =head3 Mirroring a Document Tree
  
  With a little more work, you can recursively mirror an entire set of
  linked pages.  Script I.3.2 mirrors the requested document and all
  subdocuments, using the LWP C<HTML::LinkExtor> module to extract all
  the HTML links.
  
    Script I.3.2 mirrorTree.pl
    --------------------------
    #!/usr/local/bin/perl
    
    # File: mirrorTree.pl
    
    use LWP::UserAgent;
    use HTML::LinkExtor;
    use URI::URL;
    use File::Path;
    use File::Basename;
    %DONE    = ();
    
    my $URL = shift;
    
    $UA     = new LWP::UserAgent;
    $PARSER = HTML::LinkExtor->new();
    $TOP    = $UA->request(HTTP::Request->new(HEAD => $URL));
    $BASE   = $TOP->base;
    
    mirror(URI::URL->new($TOP->request->url));
    
    sub mirror {
        my $url = shift;
    
        # get rid of query string "?" and fragments "#"
        my $path = $url->path;
        my $fixed_url = URI::URL->new ($url->scheme . '://' . $url->netloc . 
$path);
    
        # make the URL relative
        my $rel = $fixed_url->rel($BASE);
        $rel .= 'index.html' if $rel=~m!/$! || length($rel) == 0;
    
        # skip it if we've already done it
        return if $DONE{$rel}++;
    
        # create the directory if it doesn't exist already
        my $dir = dirname($rel);
        mkpath([$dir]) unless -d $dir;
    
        # mirror the document
        my $doc = $UA->mirror($fixed_url,$rel);
        print STDERR "$rel: ",$doc->message,"\n";
        return if $doc->is_error;
    
        # Follow HTML documents
        return unless $rel=~/\.html?$/i;
        my $base = $doc->base;
       
        # pull out the links and call us recursively
        my @links = $PARSER->parse_file("$rel")->links;
        my @hrefs = map { url($_->[2],$base)->abs } @links;
    
        foreach (@hrefs) {
            next unless is_child($BASE,$_);
          mirror($_);
        }
    
    }
    
    sub is_child {
        my ($base,$url) = @_;
        my $rel = $url->rel($base);
        return ($rel ne $url) && ($rel !~ m!^[/.]!);
    }
  
  =head3 Checking for Bad Links
  
  A slight modification of this last script allows you to check an
  entire document hierarchy (your own or someone else's) for bad links.
  The script shown in I.3.3 traverses a document, and checks each of the
  http:, ftp: and gopher: links to see if there's a response at the
  other end.  Links that point to sub-documents are fetched and
  traversed as before, so you can check your whole site in this way.
  
    % find_bad_links http://prego/apache-1.2/
    checking http://prego/apache-1.2/...
    checking http://prego/apache-1.2/manual/...
    checking http://prego/apache-1.2/manual/misc/footer.html...
    checking http://prego/apache-1.2/manual/misc/header.html...
    checking http://prego/apache-1.2/manual/misc/nopgp.html...
    checking 
http://www.yahoo.com/Science/Mathematics/Security_and_Encryption/...
    checking http://www.eff.org/pub/EFF/Policy/Crypto/...
    checking http://www.quadralay.com/www/Crypt/Crypt.html...
    checking http://www.law.indiana.edu/law/iclu.html...
    checking http://bong.com/~brian...
    checking http://prego/apache-1.2/manual/cgi_path.html...
    checking http://www.ics.uci.edu/pub/ietf/http/...
      . 
      . 
      .
    BAD LINKS:
    manual/misc/known_bugs.html : 
http://www.apache.org/dist/patches/apply_to_1.2b6/
    manual/misc/fin_wait_2.html : http://www.freebsd.org/
    manual/misc/fin_wait_2.html : http://www.ncr.com/
    manual/misc/compat_notes.html : http://www.eit.com/
    manual/misc/howto.html : http://www.zyzzyva.com/robots/alert/
    manual/misc/perf.html : http://www.software.hp.com/internet/perf/tuning.html
    manual/misc/perf.html : http://www.qosina.com/~awm/apache/linux-tcp.html
    manual/misc/perf.html : 
http://www.sun.com/sun-on-net/Sun.Internet.Solutions/performance/
    manual/misc/perf.html : http://www.sun.com/solaris/products/siss/
    manual/misc/nopgp.html : 
http://www.yahoo.com/Science/Mathematics/Security_and_Encryption/
    
    152 documents checked
    11 bad links
  
    Script I.3.2 find_bad_links.pl
    ------------------------------
    #!/usr/local/bin/perl
    
    # File: find_bad_links.pl
     
    use LWP::UserAgent;
    use HTML::LinkExtor;
    use URI::URL;
    use WWW::RobotRules;
    
    %CAN_HANDLE = ('http'=>1,
                'gopher'=>1,
                # 'ftp'=>1,   # timeout problems?
                );
    %OUTCOME = ();
    $CHECKED = $BAD = 0;
    @BAD = ();
    
    my $URL = shift;
    
    $UA     = new LWP::UserAgent;
    $PARSER = HTML::LinkExtor->new();
    $TOP    = $UA->request(HTTP::Request->new(HEAD => $URL));
    $BASE   = $TOP->base;
    
    # handle robot rules
    my $robots = URI::URL->new('robots.txt',$BASE->scheme.'://'.$BASE->netloc);
    my $robots_text = $UA->request(HTTP::Request->new(GET=>$robots))->content;
    $ROBOTRULES = WWW::RobotRules->new;
    $ROBOTRULES->parse($robots->abs,$robots_text);
    
    check_links(URI::URL->new($TOP->request->url));
    if (@BAD) {
        print "\nBAD LINKS:\n";
        print join("\n",@BAD),"\n\n";
    }
    print "$CHECKED documents checked\n",scalar(@BAD)," bad links\n";
    
    sub check_links {
        my $url = shift;
        my $fixed_url = $url;
        $fixed_url =~ s/\#.+$//;
        
        return 1 unless $CAN_HANDLE{$url->scheme};
        
        # check cached outcomes
        return $OUTCOME{$fixed_url} if exists $OUTCOME{$fixed_url};
        
        print STDERR "checking $fixed_url...\n";
        $CHECKED++;
        
        my $rel = $url->rel($BASE) || 'index.html';
        my $child = is_child($BASE,$url);
        $UA->timeout(5);
        my $doc = $d = $UA->request(HTTP::Request->new(($child ? 'GET' : 'HEAD' 
)=>$url));
        $OUTCOME{$fixed_url} = $doc->is_success;
        
        return $OUTCOME{$fixed_url} 
        unless $ROBOTRULES->allowed($fixed_url) 
          && $child && $doc->header('Content-type') eq 'text/html';
        
        # Follow HTML documents
        my $base = $doc->base;
        
        # pull out the links and call us recursively
        my @links = $PARSER->parse($doc->content)->links;
        my @hrefs = map { url($_->[2],$base)->abs } @links;
        
        foreach (@hrefs) {
         next if check_links($_);
         push (@BAD,"$rel : $_");
        }
        1;
    }
    
    sub is_child {
        my ($base,$url) = @_;
        my $rel = $url->rel($base);
        return ($rel ne $url) && ($rel !~ m!^[/.]!);
    }
  
  =head2 Load balancing
  
  You've hit the big time, and your site is getting more hits than you
  ever dreamed of.  Millions, zillions of hits.  What's that?  System
  load just passed 50 and response time is getting kinda' s-l-o-w-w-w?
  
  Perl to the rescue.  Set up several replica Web servers with different
  hostnames and IP addresses.  Run this script on the "main" site and
  watch it round-robin the requests to the replica servers.  It uses
  C<IO::Socket> to listen for incoming requests on port 80.  It then
  changes its privileges to run as nobody.nogroup, just like a real Web
  server.  Next it preforks itself a few times (and you always thought
  preforking was something fancy, didn't you?), and goes into an
  C<accept()> loop.  Each time an incoming session comes in, it forks
  off another child to handle the request.  The child reads the HTTP
  request and issues the an HTTP redirection to send the browser to a
  randomly selected server.
  
  NOTE: Another way to do this is to have multiple "A" records defined
  for your server's hostname and let DNS caching distribute the load.
  
    Script I.4.1: A Load Balancing "Web Server"
    -------------------------------------------
    #!/usr/local/bin/perl
    
    # list of hosts to balance between
    @HOSTS = qw/www1.web.org www2.web.org www3.web.org www4.web.org/;
    
    use IO::Socket;
    $SIG{CHLD} = sub { wait() };
    $ENV{'PATH'}='/bin:/usr/bin';
    chomp($hostname = `/bin/hostname`);
    
    # Listen on port 80
    $sock = IO::Socket::INET->new(Listen  => 5,
                                LocalPort => 80,
                                  LocalAddr => $hostname,
                                  Reuse     => 1,
                                  Proto    => 'tcp');
    
    # become "nobody"
    $nobody  = (getpwnam('nobody'))[2]  || die "nobody is nobody";
    $nogroup = (getgrnam('nogroup'))[2] || die "can't grok nogroup";
    ($<,$() = ($>,$)) = ($nobody,$nogroup); # get rid of root privileges!
    ($\,$/) = ("\r\n","\r\n\r\n");          # CR/LF on output/input
    
    # Go into server mode
    close STDIN; close STDOUT; close STDERR;
    
    # prefork -- gee is that all there is to it?
    fork() && fork() && fork() && fork() && exit 0;
    
    # start accepting connections
    while (my $s = $sock->accept()) {
        do { $s->close; next; } if fork();
        my $request = <$s>;
        redirect($1,$s) if $request=~/(?:GET|POST|HEAD|PUT)\s+(\S+)/;
        $s->flush;
        undef $s;
        exit 0;
    }
    
    sub redirect {
        my ($url,$s) = @_;
        my $host = $HOSTS[rand(@HOSTS)];
        print $s "HTTP/1.0 301 Moved Temporarily";
        print $s "Server: Lincoln's Redirector/1.0";
        print $s "Location: http://${host}${url}";;
        print $s "";
    }
  
  =head2 Torture Testing a Server
  
  Any server written in C suffers the risk of static buffer overflow
  bugs.  In the past, these bugs have led to security compromises and
  Web server breakins.  Script I.2.3 torture tests servers and CGI
  scripts by sending large amounts of random date to them.  If the
  server crashes, it probably contains a buffer overflow bug.
  
  Here's what you see when a server crashes:
  
    % torture.pl -t 1000 -l 5000 http://www.capricorn.com
    torture.pl version 1.0 starting
    Base URL:               http://www.capricorn.com/cgi-bin/search
    Max random data length: 5000
    Repetitions:            1000
    Post:                   0
    Append to path:         0
    Escape URLs:            0
    
    200 OK
    200 OK
    200 OK
    200 OK
    200 OK
    500 Internal Server Error
    500 Could not connect to www.capricorn.com:80
    500 Could not connect to www.capricorn.com:80
    500 Could not connect to www.capricorn.com:80
  
    Script I.5.1: torture tester
    ----------------------------
    #!/usr/local/bin/perl
    
    # file: torture.pl
    # Torture test Web servers and scripts by sending them large arbitrary URLs
    # and record the outcome.
    
    use LWP::UserAgent;
    use URI::Escape 'uri_escape';
    require "getopts.pl";
    
    $USAGE = <<USAGE;
    Usage: $0 -[options] URL
    Torture-test Web servers and CGI scripts
    
    Options:
     -l <integer>  Max length of random URL to send [1024 bytes]
     -t <integer>  Number of times to run the test [1]
     -P            Use POST method rather than GET method
     -p            Attach random data to path rather than query string
     -e            Escape the query string before sending it
    USAGE
    
    $VERSION = '1.0';
    
    # process command line
    &Getopts('l:t:Ppe') || die $USAGE;
    
    # get parameters
    $URL    = shift || die $USAGE;
    $MAXLEN = $opt_l ne '' ? $opt_l : 1024;
    $TIMES  = $opt_t || 1;
    $POST   = $opt_P || 0;
    $PATH   = $opt_p || 0;
    $ESCAPE = $opt_e || 0;
    
    # cannot do both a post and a path at the same time
    $POST = 0 if $PATH;
    
    # create an LWP agent
    my $agent = new LWP::UserAgent;
    
    print <<EOF;
    torture.pl version $VERSION starting
    Base URL:               $URL
    Max random data length: $MAXLEN
    Repetitions:            $TIMES
    Post:                   $POST
    Append to path:         $PATH
    Escape URLs:            $ESCAPE
    
    EOF
    
    
    # Do the test $TIMES times
    while ($TIMES) {
        # create a string of random stuff
        my $garbage = random_string(rand($MAXLEN));
        $garbage = uri_escape($garbage) if $ESCAPE;
        my $url = $URL;
        my $request;
    
        if (length($garbage) == 0) { # if no garbage to add, just fetch URL
           $request = new HTTP::Request ('GET',$url);
        }
    
        elsif ($POST) {         # handle POST request
           my $header = new HTTP::Headers (
                                          Content_Type => 
'application/x-www-form-urlencoded',
                                          Content_Length => length($garbage)
                                          );
          # garbage becomes the POST content
          $request = new HTTP::Request ('POST',$url,$header,$garbage);
          
        } else {                        # handle GET request
          
           if ($PATH) {         # append garbage to the base URL
               chop($url) if substr($url,-1,1) eq '/'; 
               $url .= "/$garbage";
           } else {             # append garbage to the query string
               $url .= "?$garbage";
           }
           
           $request = new HTTP::Request ('GET',$url);
        }
        
        # do the request and fetch the response
        my $response = $agent->request($request);
        
        # print the numeric response code and the message
        print $response->code,' ',$response->message,"\n";
    
    } continue { $TIMES-- }
    
    # return some random data of the requested length
    sub random_string {
        my $length = shift;
        return undef unless $length >= 1;
        return join('',map chr(rand(255)),0..$length-1);
    }
  
  For other load testing tools, have a look at our L<Benchmarking
  section|guide::performance/Essential_Tools>.
  
  =head1 Part II: mod_perl -- Faster Than a Speeding Bullet
  
  I<mod_perl> is Doug MacEachern's embedded Perl for Apache.  With a
  mod_perl-enabled server, there's no tedious waiting around while the
  Perl interpreter fires up, reads and compiles your script.  It's right
  there, ready and waiting.  What's more, once compiled your script
  remains in memory, all charged and raring to go.  Suddenly those
  sluggish Perl CGI scripts race along at compiled C speeds...or so it
  seems.
  
  Most CGI scripts will run unmodified under mod_perl using the
  C<Apache::Registry>x CGI compatability layer.  But that's not the whole
  story.  The exciting part is that mod_perl gives you access to the
  Apache API, letting you get at the innards of the Apache server and
  change its behavior in powerful and interesting ways.  This section
  will give you a feel for the many things that you can do with
  mod_perl.
  
  =head2 Creating Dynamic Pages
  
  This is a ho-hum because you can do it with CGI and with
  C<Apache::Registry>.  Still, it's worth seeing a simple script written
  using the strict mod_perl API so you see what it looks like.  Script
  II.1.1 prints out a little hello world message.
  
  Install it by adding a section like this one to one of the
  configuration files:
  
    <Location /hello/world>
      SetHandler  perl-script
      PerlHandler Apache::Hello
    </Location>
  
    Script II.1.1 Apache::Hello
    ---------------------------
    package Apache::Hello;
    # file: Apache/Hello.pm
    
    use strict vars;
    use Apache::Constants ':common';
    
    sub handler {
        my $r = shift;
        $r->content_type('text/html');
        $r->send_http_header;
        my $host = $r->get_remote_host;
        $r->print(<<END);
    <html>
    <head>
    <title>Hello There</title>
    </head>
    <body>
    <h1>Hello $host</h1>
    Hello to all the nice people at the Perl conference.  Lincoln is
    trying really hard.  Be kind.
    </body>
    </html>
    END
        return OK;
    }
    1;
  
  You can do all the standard CGI stuff, such as reading the query
  string, creating fill-out forms, and so on.  In fact, C<CGI.pm> works
  with mod_perl, giving you the benefit of sticky forms, cookie
  handling, and elegant HTML generation.
  
  =head2 File Filters
  
  This is where the going gets fun.  With mod_perl, you can install a
  I<content handler> that works a lot like a four-letter word
  starrer-outer, but a lot faster.
  
  =head3 Adding a Canned Footer to Every Page
  
  Script II.2.1 adds a canned footer to every HTML file.  The footer
  contains a copyright statement, plus the modification date of the
  file.  You could easily extend this to add other information, such as
  a page hit counter, or the username of the page's owner.
  
  This can be installed as the default handler for all files in a
  particular subdirectory like this:
  
    <Location /footer>
      SetHandler perl-script
      PerlHandler Apache::Footer
    </Location>
  
  
  Or you can declare a new ".footer" extension and arrange for all files
  with this extension to be passed through the footer module:
  
    AddType text/html .footer
    <Files ~ "\.footer$">
       SetHandler  perl-script
       PerlHandler Apache::Footer
    </Files>
  
  
    Script II.2.1 Apache::Footer
    ----------------------------
    package Apache::Footer;
    # file Apache::Footer.pm
    
    use strict vars;
    use Apache::Constants ':common';
    use IO::File;
    
    sub handler {
        my $r = shift;
        return DECLINED unless $r->content_type() eq 'text/html';
        my $file = $r->filename;
        return DECLINED unless $fh=IO::File->new($file);
        my $modtime = localtime((stat($file))[9]);
        my $footer=<<END;
    <hr>
    &copy; 1998 <a href="http://www.ora.com/";>O\'Reilly &amp; Associates</a><br>
    <em>Last Modified: $modtime</em>
    END
    
        $r->send_http_header;
       
        while (<$fh>) {
            s!(</BODY>)!$footer$1!oi;
        } continue {
         $r->print($_);
        }
    
        return OK;
    }
    
    1;
  
  For more customized footer/header handling, you might want to look at
  the C<Apache::Sandwich> module on CPAN.
  
  =head3 Dynamic Navigation Bar
  
  Sick of hand-coding navigation bars in every HTML page?  Less than
  enthused by the Java & JavaScript hacks?  Here's a dynamic navigation
  bar implemented as a server side include.
  
  First create a global configuration file for your site.  The first
  column is the top of each major section.  The second column is the
  label to print in the navigation bar
  
    # Configuration file for the navigation bar
    /index.html             Home
    /new/                   What's New
    /tech/                  Tech Support
    /download/              Download
    /dev/zero               Customer support              
    /dev/null               Complaints
  
  Then, at the top (or bottom) of each HTML page that you want the
  navigation bar to appear on, add this comment:
  
    <!--#NAVBAR-->
  
  Now add C<Apache::NavBar> to your system (Script II.2.2).  This module
  parses the configuration file to create a "navigation bar object".  We
  then call the navigation bar object's C<to_html()> method in order to
  generate the HTML for the navigation bar to display on the current
  page (it will be different for each page, depending on what major
  section the page is in).
  
  The next section does some checking to avoid transmitting the page
  again if it is already cached on the browser. The effective last
  modified time for the page is either the modification time of its HTML
  source code, or the navbar's configuration file modification date,
  whichever is more recent.
  
  The remainder is just looping through the file a section at a time,
  searching for the C<E<lt>!--NAVBAR--E<gt>> comment, and substituting
  the navigation bar HTML.
  
    Script II.2.2 Apache::NavBar
    ----------------------------
    
    package Apache::NavBar;
    # file Apache/NavBar.pm
    
    use strict;
    use Apache::Constants qw(:common);
    use Apache::File ();
    
    my %BARS = ();
    my $TABLEATTS   = 'WIDTH="100%" BORDER=1';
    my $TABLECOLOR  = '#C8FFFF';
    my $ACTIVECOLOR = '#FF0000';
    
    sub handler {
        my $r = shift;
    
        my $bar = read_configuration($r)         || return DECLINED;
        $r->content_type eq 'text/html'          || return DECLINED;
        my $fh = Apache::File->new($r->filename) || return DECLINED;
        my $navbar = $bar->to_html($r->uri);
        
        $r->update_mtime($bar->modified);
        $r->set_last_modified;
        my $rc = $r->meets_conditions;
        return $rc unless $rc == OK;
    
        $r->send_http_header;
        return OK if $r->header_only;
    
        local $/ = "";
        while (<$fh>) {
           s:<!--NAVBAR-->:$navbar:oi;
        } continue { 
           $r->print($_); 
        }
    
        return OK;
    }
    
    # read the navigation bar configuration file and return it as a
    # hash.
    sub read_configuration {
        my $r = shift;
        my $conf_file;
        return unless $conf_file = $r->dir_config('NavConf');
        return unless -e ($conf_file = $r->server_root_relative($conf_file));
        my $mod_time = (stat _)[9];
        return $BARS{$conf_file} if $BARS{$conf_file} 
          && $BARS{$conf_file}->modified >= $mod_time;
        return $BARS{$conf_file} = NavBar->new($conf_file);
    }
    
    package NavBar;
    
    # create a new NavBar object
    sub new {
        my ($class,$conf_file) = @_;
        my (@c,%c);
        my $fh = Apache::File->new($conf_file) || return;
        while (<$fh>) {
           chomp;
           s/^\s+//; s/\s+$//;   #fold leading and trailing whitespace
           next if /^#/ || /^$/; # skip comments and empty lines
           next unless my($url, $label) = /^(\S+)\s+(.+)/;
           push @c, $url;     # keep the url in an ordered array
           $c{$url} = $label; # keep its label in a hash
        }
        return bless {'urls' => [EMAIL PROTECTED],
                     'labels' => \%c,
                     'modified' => (stat $conf_file)[9]}, $class;
    }
    
    # return ordered list of all the URIs in the navigation bar
    sub urls  { return @{shift->{'urls'}}; }
    
    # return the label for a particular URI in the navigation bar
    sub label { return $_[0]->{'labels'}->{$_[1]} || $_[1]; }
    
    # return the modification date of the configuration file
    sub modified { return $_[0]->{'modified'}; }
    
    sub to_html {
        my $self = shift;
        my $current_url = shift;
        my @cells;
        for my $url ($self->urls) {
           my $label = $self->label($url);
           my $is_current = $current_url =~ /^$url/;
           my $cell = $is_current ?
               qq(<FONT COLOR="$ACTIVECOLOR">$label</FONT>)
                   : qq(<A HREF="$url">$label</A>);
           push @cells, 
           qq(<TD CLASS="navbar" ALIGN=CENTER 
BGCOLOR="$TABLECOLOR">$cell</TD>\n);
        }
        return qq(<TABLE $TABLEATTS><TR>@cells</TR></TABLE>\n);
    }
    
    
    1;
    __END__
  
   <Location />
     SetHandler  perl-script
     PerlHandler Apache::NavBar
     PerlSetVar  NavConf etc/navigation.conf
   </Location>
  
  C<Apache::NavBar> is available on the CPAN, with further improvements.
  
  =head3 On-the-Fly Compression
  
  WU-FTP has a great feature that automatically gzips a file if you
  fetch it by name with a I<.gz> extension added.  Why can't Web servers
  do that trick?  With Apache and mod_perl, you can.
  
  Script II.2.4 is a content filter that automatically gzips everything
  retrieved from a particular directory and adds the "gzip"
  C<Content-Encoding> header to it.  Unix versions of Netscape Navigator
  will automatically recognize this encoding type and decompress the
  file on the fly.  Windows and Mac versions don't.  You'll have to save
  to disk and decompress, or install the WinZip plug-in.  Bummer.
  
  The code uses the C<Compress::Zlib> module, and has to do a little
  fancy footwork (but not too much) to create the correct gzip header.
  You can extend this idea to do on-the-fly encryption, or whatever you
  like.
  
  Here's the configuration entry you'll need.  Everything in the
  I</compressed> directory will be compressed automagically.
  
    <Location /compressed>
       SetHandler  perl-script
       PerlHandler Apache::GZip
    </Location>
  
    Script II.2.3: Apache::GZip
    ---------------------------
    package Apache::GZip;
    #File: Apache::GZip.pm
    
    use strict vars;
    use Apache::Constants ':common';
    use Compress::Zlib;
    use IO::File;
    use constant GZIP_MAGIC => 0x1f8b;
    use constant OS_MAGIC => 0x03;
    
    sub handler {
        my $r = shift;
        my ($fh,$gz);
        my $file = $r->filename;
        return DECLINED unless $fh=IO::File->new($file);
        $r->header_out('Content-Encoding'=>'gzip');
        $r->send_http_header;
        return OK if $r->header_only;
    
        tie *STDOUT,'Apache::GZip',$r;
        print($_) while <$fh>;
        untie *STDOUT;
        return OK;
    }
    
    sub TIEHANDLE {
        my($class,$r) = @_;
        # initialize a deflation stream
        my $d = deflateInit(-WindowBits=>-MAX_WBITS()) || return undef;
    
        # gzip header -- don't ask how I found out
        $r->print(pack("nccVcc",GZIP_MAGIC,Z_DEFLATED,0,time(),0,OS_MAGIC));
    
        return bless { r   => $r,
                       crc =>  crc32(undef),
                       d   => $d,
                       l   =>  0 
                     },$class;
    }
    
    sub PRINT {
        my $self = shift;
        foreach (@_) {
          # deflate the data
          my $data = $self->{d}->deflate($_);
          $self->{r}->print($data);
          # keep track of its length and crc
          $self->{l} += length($_);
          $self->{crc} = crc32($_,$self->{crc});
        }
    }
    
    sub DESTROY {
       my $self = shift;
       
       # flush the output buffers
       my $data = $self->{d}->flush;
       $self->{r}->print($data);
       
       # print the CRC and the total length (uncompressed)
       $self->{r}->print(pack("LL",@{$self}{qw/crc l/}));
    }
     
    1;
  
  For some alternatives that are being maintained, you might want to
  look at the C<Apache::Compress> and C<Apache::GzipChain> modules on
  CPAN, which can handle the output of any handler in a chain.
  
  By adding a URI translation handler, you can set things up so that a
  remote user can append a I<.gz> to the end of any URL and the file we
  be delivered in compressed form.  Script II.2.4 shows the translation
  handler you need.  It is called during the initial phases of the
  request to make any modifications to the URL that it wishes.  In this
  case, it removes the I<.gz> ending from the filename and arranges for
  C<Apache:GZip> to be called as the content handler.  The
  C<lookup_uri()> call is used to exclude anything that has a special
  handler already defined (such as CGI scripts), and actual gzip files.
  The module replaces the information in the request object with
  information about the real file (without the C<.gz>), and arranges for
  C<Apache::GZip> to be the content handler for this file.
  
  You just need this one directive to activate handling for all URLs at
  your site:
  
    PerlTransHandler Apache::AutoGZip
  
    Script II.2.4: Apache::AutoGZip
    -------------------------------
    package Apache::AutoGZip;
    
    use strict 'vars';
    use Apache::Constants qw/:common/;
    
    sub handler {
        my $r = shift;
    
        # don't allow ourselves to be called recursively
        return DECLINED unless $r->is_initial_req;
    
        # don't do anything for files not ending with .gz
        my $uri = $r->uri;
        return DECLINED unless $uri=~/\.gz$/; 
        my $basename = $`;
    
        # don't do anything special if the file actually exists
        return DECLINED if -e $r->lookup_uri($uri)->filename;
    
        # look up information about the file
        my $subr = $r->lookup_uri($basename);
        $r->uri($basename);
        $r->path_info($subr->path_info);
        $r->filename($subr->filename);
    
        # fix the handler to point to Apache::GZip;
        my $handler = $subr->handler;
        unless ($handler) {
          $r->handler('perl-script');
          $r->push_handlers('PerlHandler','Apache::GZip');
        } else {
          $r->handler($handler);
        }
        return OK;
    }
    
    1;
  
  =head2 Access Control
  
  Access control, as opposed to authentication and authorization, is
  based on something the user "is" rather than something he "knows".
  The "is" is usually something about his browser, such as its IP
  address, hostname, or user agent.  Script II.3.1 blocks access to the
  Web server for certain User Agents (you might use this to block
  impolite robots).
  
  C<Apache::BlockAgent> reads its blocking information from a "bad
  agents" file, which contains a series of pattern matches.  Most of the
  complexity of the code comes from watching this file and recompiling
  it when it changes.  If the file doesn't change, it's is only read
  once and its patterns compiled in memory, making this module fast.
  
  Here's an example bad agents file:
  
     ^teleport pro\/1\.28
     ^nicerspro
     ^mozilla\/3\.0 \(http engine\)
     ^netattache
     ^crescent internet toolpak http ole control v\.1\.0
     ^go-ahead-got-it
     ^wget
     ^devsoft's http component v1\.0
     ^www\.pl
     ^digout4uagent
  
  A configuration entry to activate this blocker looks like this.  In
  this case we're blocking access to the entire site.  You could also
  block access to a portion of the site, or have different bad agents
  files associated with different portions of the document tree.
  
    <Location />
      PerlAccessHandler Apache::BlockAgent
      PerlSetVar BlockAgentFile /home/www/conf/bad_agents.txt
    </Location>
  
  
    Script II.3.1: Apache::BlockAgent
    ---------------------------------
    package Apache::BlockAgent;
    # block browsers that we don't like
    
    use strict 'vars';
    use Apache::Constants ':common';
    use IO::File;
    my %MATCH_CACHE;
    my $DEBUG = 0;
    
    sub handler {
        my $r = shift;
         
        return DECLINED unless my $patfile = $r->dir_config('BlockAgentFile');
        return FORBIDDEN unless my $agent = $r->header_in('User-Agent');
        return SERVER_ERROR unless my $sub = get_match_sub($r,$patfile);
        return OK if $sub->($agent);
        $r->log_reason("Access forbidden to agent $agent",$r->filename);
        return FORBIDDEN;
    }
     
    # This routine creates a pattern matching subroutine from a
    # list of pattern matches stored in a file.
    sub get_match_sub {
        my ($r,$filename) = @_;
        my $mtime = -M $filename;
      
        # try to return the sub from cache
        return $MATCH_CACHE{$filename}->{'sub'} if
            $MATCH_CACHE{$filename} && 
               $MATCH_CACHE{$filename}->{'mod'} <= $mtime;
     
        # if we get here, then we need to create the sub
        return undef unless my $fh = new IO::File($filename);
        chomp(my @pats = <$fh>); # get the patterns into an array
        my $code = "sub { \$_ = shift;\n";
        foreach (@pats) {
            next if /^#/
            $code .= "return undef if /$_/i;\n";
        }
        $code .= "1; }\n";     
        warn $code if $DEBUG;
     
        # create the sub, cache and return it
        my $sub = eval $code;
        unless ($sub) {
            $r->log_error($r->uri,": ",$@);
            return undef;
        }
        @{$MATCH_CACHE{$filename}}{'sub','mod'}=($sub,$modtime);
        return $MATCH_CACHE{$filename}->{'sub'};
    }
     
    1;
  
  =head2 Authentication and Authorization
  
  Thought you were stuck with authentication using text, DBI and DBM
  files?  mod_perl opens the authentication/authorization API wide.  The
  two phases are authentication, in which the user has to prove who he
  or she is (usually by providing a username and password), and
  authorization, in which the system decides whether this user has
  sufficient privileges to view the requested URL.  A scheme can
  incorporate authentication and authorization either together or
  singly.
  
  =head3 Authentication with NIS
  
  If you keep Unix system passwords in I</etc/passwd> or distribute them
  by NIS (not NIS+) you can authenticate Web users against the system
  password database.  (It's not a good idea to do this if the system is
  connected to the Internet because passwords travel in the clear, but
  it's OK for trusted intranets.)
  
  Script II.4.1 shows how the C<Apache::AuthSystem> module fetches the
  user's name and password, compares it to the system password, and
  takes appropriate action.  The C<getpwnam()> function operates either
  on local files or on the NIS database, depending on how the server
  host is configured.  WARNING: the module will fail if you use a shadow
  password system, since the Web server doesn't have root privileges.
  
  In order to activate this system, put a configuration directive like
  this one in access.conf:
  
    <Location /protected>
      AuthName Test
      AuthType Basic
      PerlAuthenHandler Apache::AuthSystem
      require valid-user
    </Location>
  
    Script II.4.1: Apache::AuthSystem
    ---------------------------------
    package Apache::AuthSystem;
    # authenticate users on system password database
    
    use strict;
    use Apache::Constants ':common';
    
    sub handler {
       my $r = shift;
    
       my($res, $sent_pwd) = $r->get_basic_auth_pw;
       return $res if $res != OK;
       
       my $user = $r->connection->user;
       my $reason = "";
    
       my($name,$passwd) = getpwnam($user);
       if (!$name) {
           $reason = "user does not have an account on this system";
       } else {
           $reason = "user did not provide correct password"
             unless $passwd eq crypt($sent_pwd,$passwd);
       }
    
       if($reason) {
          $r->note_basic_auth_failure;
          $r->log_reason($reason,$r->filename);
          return AUTH_REQUIRED;
       }
    
       return OK;
    }
    
    1;
  
  There are modules doing equivalent things on CPAN:
  C<Apache::AuthenPasswd> and C<Apache::AuthxPasswd>.
  
  =head3 Anonymous Authentication
  
  Here's a system that authenticates users the way anonymous FTP does.
  They have to enter a name like "Anonymous" (configurable) and a
  password that looks like a valid e-mail address.  The system rejects
  the username and password unless they are formatted correctly.
  
  In a real application, you'd probably want to log the password
  somewhere for posterity.  Script II.4.2 shows the code for
  C<Apache::AuthAnon>.  To activate it, create a I<httpd.conf> section
  like this one:
  
    <Location /protected>
          AuthName Anonymous
          AuthType Basic
          PerlAuthenHandler Apache::AuthAnon
          require valid-user
    
          PerlSetVar Anonymous anonymous|anybody
    </Location>
  
  
    Script II.4.2: Anonymous Authentication
    ---------------------------------------
    package Apache::AuthAnon;
    
    use strict;
    use Apache::Constants ':common';
    
    my $email_pat = '[EMAIL PROTECTED]';
    my $anon_id  = "anonymous";
    
    sub handler {
        my $r = shift;
    
        my($res, $sent_pwd) = $r->get_basic_auth_pw;
        return $res if $res != OK;
    
        my $user = lc $r->connection->user;
        my $reason = "";
    
        my $check_id = $r->dir_config("Anonymous") || $anon_id;
    
        unless($user =~ /^$check_id$/i) {
          $reason = "user did not enter a valid anonymous username";
        }
    
        unless($sent_pwd =~ /$email_pat/o) {
          $reason = "user did not enter an email address password";
        } 
    
        if($reason) {
          $r->note_basic_auth_failure;
          $r->log_reason($reason,$r->filename);
          return AUTH_REQUIRED;
        }
    
        $r->notes(AuthAnonPassword => $sent_pwd);
    
        return OK;
    }
    
    1;
  
  
  =head3 Gender-Based Authorization
  
  After authenticating, you can authorize.  The most familiar type of
  authorization checks a group database to see if the user belongs to
  one or more privileged groups.  But authorization can be anything you
  dream up.
  
  Script II.4.3 shows how you can authorize users by their gender (or at
  least their I<apparent> gender, by checking their names with Jon
  Orwant's C<Text::GenderFromName> module.  This must be used in
  conjunction with an authentication module, such as one of the standard
  Apache modules or a custom one.
  
  This configuration restricts access to users with feminine names,
  except for the users "Webmaster" and "Jeff", who are allowed access.
  
    <Location /ladies_only>
      AuthName "Ladies Only"
      AuthType Basic
      AuthUserFile /home/www/conf/users.passwd
      PerlAuthzHandler  Apache::AuthzGender
      require gender F            # allow females
      require user Webmaster Jeff # allow Webmaster or Jeff
    </Location>
  
  The script uses a custom error response to explain why the user was
  denied admittance.  This is better than the standard "Authorization
  Failed" message.
  
    Script II.4.3: Apache::AuthzGender
    ----------------------------------
    package Apache::AuthzGender;
    
    use strict;
    use Text::GenderFromName;
    use Apache::Constants ":common";
    
    my %G=('M'=>"male",'F'=>"female");
    
    sub handler {
        my $r = shift;
       
        return DECLINED unless my $requires = $r->requires;
        my $user = lc($r->connection->user);
        substr($user,0,1)=~tr/a-z/A-Z/;
        my $guessed_gender = uc(gender($user)) || 'M';
    
        my $explanation = <<END;
    <html><head><title>Unauthorized</title></head><body>
    <h1>You Are Not Authorized to Access This Page</h1>
    Access to this page is limited to:
    <ol>
    END
    
        foreach (@$requires) {
          my ($requirement,@rest ) = split(/\s+/,$_->{requirement});
          if (lc $requirement eq 'user') {
               foreach (@rest) { return OK if $user eq $_; }
               $explanation .= "<LI>Users @rest.\n";
          } elsif (lc $requirement eq 'gender') {
               foreach (@rest) { return OK if $guessed_gender eq uc $_; }
               $explanation .= "<LI>People of the @[EMAIL PROTECTED] 
persuasion.\n";
          } elsif (lc $requirement eq 'valid-user') {
               return OK;
          }
        }
    
        $explanation .= "</OL></BODY></HTML>";
       
        $r->custom_response(AUTH_REQUIRED,$explanation);
        $r->note_basic_auth_failure;
        $r->log_reason("user $user: not authorized",$r->filename);
        return AUTH_REQUIRED;
    }
  
    1;
  
  C<Apache::AuthzGender> is available from the CPAN.
  
  =head2 Proxy Services
  
  mod_perl gives you access to Apache's ability to act as a Web proxy.
  You can intervene at any step in the proxy transaction to modify the
  outgoing request (for example, stripping off headers in order to
  create an anonymizing proxy) or to modify the returned page.
  
  =head3 A Banner Ad Blocker
  
  Script II.5.1 shows the code for a banner-ad blocker written by Doug
  MacEachern.  It intercepts all proxy requests, substituting its own
  content handler for the default.  The content handler uses the LWP
  library to fetch the requested document. If the retrieved document is
  an image, and its URL matches the pattern (ads?|advertisement|banner),
  then the content of the image is replaced with a dynamically-generated
  GIF that reads "Blocked Ad". The generated image is exactly the same
  size as the original, preserving the page layout.  Notice how the
  outgoing headers from the Apache request object are copied to the LWP
  request, and how the incoming LWP response headers are copied back to
  Apache.  This makes the transaction nearly transparent to Apache and
  to the remote server.
  
  In addition to LWP you'll need C<GD.pm> and C<Image::Size> to run this
  module.  To activate it, add the following line to the configuration
  file:
  
    PerlTransHandler Apache::AdBlocker
  
  Then configure your browser to use the server to proxy all its HTTP
  requests.  Works like a charm!  With a little more work, and some help
  from the C<ImageMagick> module, you could adapt this module to
  quiet-down animated GIFs by stripping them of all but the very first
  frame.
  
    Script II.5.1: Apache::AdBlocker
    --------------------------------
    
    package Apache::AdBlocker;
    
    use strict;
    use vars qw(@ISA $VERSION);
    use Apache::Constants qw(:common);
    use GD ();
    use Image::Size qw(imgsize);
    use LWP::UserAgent ();
    
    @ISA = qw(LWP::UserAgent);
    $VERSION = '1.00';
    
    my $UA = __PACKAGE__->new;
    $UA->agent(join "/", __PACKAGE__, $VERSION);
    
    my $Ad = join "|", qw{ads? advertisement banner};
    
    sub handler {
        my($r) = @_;
        return DECLINED unless $r->proxyreq;
        $r->handler("perl-script"); #ok, let's do it
        $r->push_handlers(PerlHandler => \&proxy_handler);
        return OK;
    }
    
    sub proxy_handler {
        my($r) = @_;
     
        my $request = HTTP::Request->new($r->method, $r->uri);
        
        $r->headers_in->do(sub { 
           $request->header(@_); 
        });
    
        # copy POST data, if any
        if($r->method eq 'POST') {
           my $len = $r->header_in('Content-length');
           my $buf;
           $r->read($buf, $len);
           $request->content($buf);
           $request->content_type($r->content_type);
        }
     
        my $response = $UA->request($request);
        $r->content_type($response->header('Content-type'));
    
        #feed response back into our request_rec*
        $r->status($response->code);
        $r->status_line(join " ", $response->code, $response->message);
        $response->scan(sub {
           $r->header_out(@_);
        });
    
        if ($r->header_only) {
           $r->send_http_header();
           return OK;
        }
    
        my $content = \$response->content;
        if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
           block_ad($content);
           $r->content_type("image/gif");
        }
     
        $r->content_type('text/html') unless $$content;
        $r->send_http_header;
        $r->print($$content || $response->error_as_HTML);
    
        return OK;
    }
    
    sub block_ad {
        my $data = shift;
        my($x, $y) = imgsize($data);
     
        my $im = GD::Image->new($x,$y);
     
        my $white = $im->colorAllocate(255,255,255);
        my $black = $im->colorAllocate(0,0,0);       
        my $red = $im->colorAllocate(255,0,0);      
    
        $im->transparent($white);
        $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
        $im->rectangle(0,0,$x-1,$y-1,$black);
    
        $$data = $im->gif;
    }
    
    1;
  
  Another way of doing this module would be to scan all proxied HTML
  files for C<E<lt>imgE<gt>> tags containing one of the verboten URLs,
  then replacing the C<src> attribute with a transparent GIF of our own.
  However, unless the C<E<lt>imgE<gt>> tag contained C<width> and
  C<height> attributes, we wouldn't be able to return a GIF of the
  correct size -- unless we were to go hunting for the GIF with LWP, in
  which case we might as well do it this way.
  
  =head2 Customized Logging
  
  After Apache handles a transaction, it passes all the information
  about the transaction to the log handler.  The default log handler
  writes out lines to the log file.  With mod_perl, you can install your
  own log handler to do customized logging.
  
  =head3 Send E-Mail When a Particular Page Gets Hit
  
  Script II.6.1 installs a log handler which watches over a page or set
  of pages.  When someone fetches a watched page, the log handler sends
  off an e-mail to notify someone (probably the owner of the page) that
  the page has been read.
  
  To activate the module, just attach a C<PerlLogHandler> to the
  C<E<lt>LocationE<gt>> or C<E<lt>FilesE<gt>> you wish to watch.  For
  example:
  
     <Location /~lstein>
        PerlLogHandler Apache::LogMail
        PerlSetVar mailto [EMAIL PROTECTED]
     </Location>
  
  The "mailto" directive specifies the name of the recipient(s) to
  notify.
  
  
    Script II.6.1: Apache::LogMail
    ------------------------------
    package Apache::LogMail;
    use Apache::Constants ':common';
    
    sub handler {
        my $r = shift;
        my $mailto = $r->dir_config('mailto');
        return DECLINED unless $mailto
        my $request = $r->the_request;
        my $uri = $r->uri;
        my $agent = $r->header_in("User-agent");
        my $bytes = $r->bytes_sent;
        my $remote = $r->get_remote_host;
        my $status = $r->status_line;
        my $date = localtime;
        unless (open (MAIL,"|/usr/lib/sendmail -oi -t")) {
           $r->log_error("Couldn't open mail: $!");
           return DECLINED;
        }
        print MAIL <<END;
    To: $mailto
    From: Mod Perl <webmaster>
    Subject: Somebody looked at $uri
    
    At $date, a user at $remote looked at
    $uri using the $agent browser.  
    
    The request was $request, 
    which resulted returned a code of $status.
    
    $bytes bytes were transferred.
    END
        close MAIL;
        return OK;
    }
    1;
  
  =head3 Writing Log Information Into a Relational Database
  
  Coming full circle, Script II.6.2 shows a module that writes log
  information into a DBI database.  The idea is similar to Script I.1.9,
  but there's now no need to open a pipe to an external process.  It's
  also a little more efficient, because the log data fields can be
  recovered directly from the Apache request object, rather than parsed
  out of a line of text.  Another improvement is that we can set up the
  Apache configuration files so that only accesses to certain
  directories are logged in this way.
  
  To activate, add something like this to your configuration file:
   
    PerlLogHandler Apache::LogDBI
  
  Or, to restrict special logging to accesses of files in below the URL
  "/lincoln_logs" add this:
  
    <Location /lincoln_logs>
      PerlLogHandler Apache::LogDBI  
    </Location>
  
    Script II.6.2: Apache::LogDBI
    -----------------------------
    package Apache::LogDBI;
    use Apache::Constants ':common';
    
    use strict 'vars';
    use vars qw($DB $STH);
    use DBI;
    use POSIX 'strftime';
    
    use constant DSN       => 'dbi:mysql:www';
    use constant DB_TABLE  => 'access_log';
    use constant DB_USER   => 'nobody';
    use constant DB_PASSWD => '';
    
    $DB = DBI->connect(DSN,DB_USER,DB_PASSWD) || die DBI->errstr;
    $STH = $DB->prepare("INSERT INTO ${\DB_TABLE} VALUES(?,?,?,?,?,?,?,?,?)") 
         || die $DB->errstr;
    
    sub handler {
        my $r = shift;
        my $date    = strftime('%Y-%m-%d %H:%M:%S',localtime);
        my $host    = $r->get_remote_host;
        my $method  = $r->method;
        my $url     = $r->uri;
        my $user    = $r->connection->user;
        my $referer = $r->header_in('Referer');
        my $browser = $r->header_in("User-agent");
        my $status  = $r->status;
        my $bytes   = $r->bytes_sent;
        $STH->execute($date,$host,$method,$url,$user,
                      $browser,$referer,$status,$bytes);
        return OK;
    }
    
    1;
  
  There are other alternatives which are more actively maintained
  available from the CPAN: C<Apache::DBILogger> and
  C<Apache::DBILogConfig>.
  
  =head1 Conclusion
  
  These tricks illustrate the true power of mod_perl; not only were Perl
  and Apache good friends from the start, thanks to Perl's excellent
  text-handling capacity, but when mod_perl is used, your complete
  access to the Apache API gives you unprecendented power in dynamic web
  serving.
  
  To find more tips and tricks, look for modules on the CPAN, look
  through the mod_perl L<documentation|docs::index>, and also in the
  following books by Lincoln Stein:
  
  =over 4
  
  =item "How to Set Up and Maintain a Web Site"
  
  General introduction to Web site care and feeding, with an emphasis on
  Apache.  Addison-Wesley 1997.
  
  Companion Web site at http://www.genome.wi.mit.edu/WWW/
  
  =item "Web Security, a Step-by-Step Reference Guide"
  
  How to keep your Web site free from thieves, vandals, hooligans and
  other yahoos.  Addison-Wesley 1998.
  
  Companion Web site at http://www.w3.org/Security/Faq/
  
  =item "The Official Guide to Programming with CGI.pm"
  
  Everything I know about CGI.pm (and some things I don't!).  John Wiley
  & Sons, 1998.
  
  Companion Web site at http://www.wiley.com/compbooks/stein/
  
  =item "Writing Apache Modules in Perl and C"
  
  Co-authored with Doug MacEachern. O'Reilly & Associates.
  
  Companion Web site at http://www.modperl.com/
  
  =item WebTechniques Columns
  
  I write a monthly column for WebTechniques magazine.  You can find
  back-issues and reprints at http://www.web-techniques.com/
  
  =item The Perl Journal Columns
  
  I write a quarterly column for TPJ. Source code listings are available
  at http://www.tpj.com/
  
  =back
  
  
  =head1 Maintainers
  
  Maintainer is the person(s) you should contact with updates,
  corrections and patches.
  
  Per Einar Ellefsen E<lt>per.einar (at) skynet.beE<gt>
  
  =head1 Authors
  
  =over
  
  =item * Lincoln Stein E<lt>lstein (at) cshl.orgE<gt>
  
  =back
  
  Only the major authors are listed above. For contributors see the
  Changes file.
  
  =cut
  
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/app_servers.png
  
        <<Binary file>>
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/code_structure.png
  
        <<Binary file>>
  
  
  1.1                  modperl-docs/src/docs/tutorials/scale_etoys/etoys.pod
  
  Index: etoys.pod
  ===================================================================
  =head1 NAME
  
  Building a Large-Scale E-commerce site with Apache and mod_perl
  
  =head1 Description
  
  mod_perl's speed and Perl's flexibility make them very attractive for
  large-scale sites. Through careful planning from the start, powerful
  application servers can be created for sites requiring excellent
  response times for dynamic content, such as EToys, all by using
  mod_perl.
  
  This paper was first presented at ApacheCon 2001 in Santa Clara,
  California, and was later published by O'Reilly & Associates' Perl.com
  site: http://perl.com/pub/a/2001/10/17/etoys.html
  
  =head1 Common Myths
  
  When it comes to building a large e-commerce web site, everyone is
  full of advice. Developers will tell you that only a site built in C++
  or Java (depending on which they prefer) can scale up to handle heavy
  traffic. Application server vendors will insist that you need a
  packaged all-in-one solution for the software. Hardware vendors will
  tell you that you need the top-of-the-line mega-machines to run a
  large site. This is a story about how we built a large e-commerce site
  using mainly open source software and commodity hardware. We did it,
  and you can do it too.
  
  =head1 Perl Saves
  
  Perl has long been the preferred language for developing CGI scripts.
  It combines supreme flexibility with rapid development. I<Programming
  Perl> is still one of O'Reilly's top selling technical books, and
  community support abounds. Lately though, Perl has come under attack
  from certain quarters. Detractors claim that it's too slow for serious
  development work and that code written in Perl is too hard to
  maintain.
  
  The mod_perl Apache module changes the whole performance picture for
  Perl. Embedding a Perl interpreter inside of Apache provides
  performance equivalent to Java servlets, and makes it an excellent
  choice for building large sites. Through the use of Perl's
  object-oriented features and some basic coding rules, you can build a
  set of code that is a pleasure to maintain, or at least no worse than
  other languages.
  
  =head2 Roll Your Own Application Server
  
  When you combine Apache, mod_perl, and open source code available from
  CPAN (the Comprehensive Perl Archive Network), you get a set of
  features equivalent to a commercial application server:
  
  =over
  
  =item *
  
  Session handling
  
  =item *
  
  Load balancing
  
  =item *
  
  Persistent database connections
  
  =item *
  
  Advanced HTML templating
  
  =item *
  
  Security
  
  =back
  
  You also get some things you won't get from a commercial product, like
  a direct line to the core development team through the appropriate
  mailing list, and the ability to fix problems yourself instead of
  waiting for a patch. Moreover, every part of the system is under your
  control, making you limited only by your team's abilities.
  
  =head1 Case Study: eToys.com
  
  When we first arrived at eToys in 1999, we found a situation that is
  probably familiar to many who have joined a growing startup Internet
  company. The system was based on CGI scripts talking to a MySQL
  database. Static file serving and dynamic content generation were
  sharing resources on the same machines. The CGI code was largely
  written in a Perl4-ish style and not as modular as it could be, which
  was not surprising since most of it was built as quickly as possible
  by a very small team.
  
  Our major task was to figure out how to get this system to scale large
  enough to handle the expected Christmas traffic. The toy business is
  all about seasonality, and the difference between the peak selling
  season and the rest of the year is enormous. The site had barely
  survived the previous Christmas, and the MySQL database didn't look
  like it could scale much further.
  
  The call had already been made to switch to Oracle, and a DBA team was
  in place. We didn't have enough time to do a re-design of the
  software, so we had to scramble to put in place whatever performance
  improvements we could finish by Christmas.
  
  =head2 Apache::PerlRun to the Rescue
  
  C<Apache::PerlRun> is a module that exists to smooth the transition
  between basic CGI and mod_perl. It emulates a CGI environment, and
  provides some (but not all) of the performance benefits associated
  with code written for mod_perl. Using this module and the persistent
  database connections provided by C<Apache::DBI>, we were able to do a
  basic port to mod_perl and Oracle in time for Christmas, and combined
  with some new hardware we were ready to face the Christmas rush.
  
  The peak traffic lasted for eight weeks, most of which were spent
  frantically fixing things or nervously waiting for something else to
  break. Nevertheless, we made it through. During that time we collected
  the following statistics:
  
  =over
  
  =item *
  
  60 - 70,000 sessions/hour
  
  =item *
  
  800,000 page views/hour
  
  =item *
  
  7,000 orders/hour
  
  =back
  
  According to Media Metrix, we were the third most heavily trafficked
  e-commerce site, right behind eBay and Amazon.
  
  =head2 Planning the New Architecture
  
  It was clear that we would need to do a re-design for 2000. We had
  reached the limits of the current system and needed to tackle some of
  the harder problems that we had been holding off on.
  
  Goals for the new system included moving away from off-line page
  generation. The old system had been building HTML pages for every
  product and product category on the site in a batch job and dumping
  them out as static files. This was very effective when we had a small
  database of products since the static files gave such good
  performance, but we had recently added a children's bookstore to the
  site, which increased the size of our product database by an order of
  magnitude and made the time required to generate every page
  prohibitive. We needed a strategy that would only require us to build
  pages that customers were actually interested in and would still
  provide solid performance.
  
  We also wanted to re-do the database schema for more flexibility, and
  structure the code in a more modular way that would make it easier for
  a team to share the development work without stepping on each other. We
  knew that the new codebase would have to be flexible enough to support
  a continuously evolving set of features.
  
  Not all of the team had significant experience with object-oriented
  Perl, so we brought in Randal Schwartz and Damian Conway to do
  training sessions with us. We created a set of coding standards,
  drafted a design, and built our system.
  
  =head1 Surviving Christmas 2000
  
  Our capacity planning was for three times the traffic of the previous
  peak. That's what we tested to, and that's about what we got:
  
  =over
  
  =item *
  
  200,000+ sessions/hour
  
  =item *
  
  2.5 million+ page views/hour
  
  =item *
  
  20,000+ orders/hour
  
  =back
  
  The software survived, although one of the routers went up in smoke.
  Once again, we were rated the third most highly trafficked e-commerce
  site for the season.
  
  =head2 The Architecture
  
  The machine strategy for the system is a fairly common one: low-cost
  Intel-based servers with a load-balancer in front of them, and big
  iron for the database.
  
  =for html
  <p><b>Figure 1.</b> Server layout</p>
  <img src="machine_layout.png" alt="Machine Layout" width="344"
  height="472">
  
  Like many commercial packages, we have separate systems for the
  front-end web servers (which we call proxy servers) and the
  application servers that generate the dynamic content. Both the proxy
  servers and the application servers are load-balanced using dedicated
  hardware from f5 Networks.
  
  We chose to run Linux on our proxy and application servers, a common
  platform for mod_perl sites. The ease of remote administration under
  Linux made the clustered approach possible. Linux also provided solid
  security features and automated build capabilities to help with adding
  new servers.
  
  The database servers were IBM NUMA-Q machines, which ran on the
  DYNIX/ptx operating system..
  
  =head2 Proxy Servers
  
  The proxy servers ran a slim build of Apache, without mod_perl. They
  have several standard Apache modules installed, in addition to our own
  customized version of mod_session, which assigned session cookies.
  Because the processes were so small, we could run up to 400 Apache
  children per machine. These servers handled all image requests
  themselves, and passed page requests on to the application
  servers. They communicated with the app servers using standard HTTP
  requests, and cached the page results when appropriate headers are
  sent from the app servers. The cached pages were stored on a shared
  NFS partition of a Network Appliance filer. Serving pages from the
  cache was nearly as fast as serving static files.
  
  This kind of reverse-proxy setup is a commonly recommended approach
  when working with mod_perl, since it uses the lightweight proxy
  processes to send out the content to clients (who may be on slow
  connections) and frees the resource-intensive mod_perl processes to
  move on to the next request. For more information on why this
  configuration is helpful, see the L<strategy section in the users
  guide|guide::strategy>.
  
  =for html
  <p><b>Figure 2.</b> Proxy Server Setup</p>
  <img src="proxy_servers.png" alt="Proxy Server Setup" width="356"
  height="211">
  
  =head2 Application Servers
  
  The application servers ran mod_perl, and very little else.  They had
  a local cache for Perl objects, using Berkeley DB.  The web
  applications ran there, and shared resources like HTML templates were
  mounted over NFS from the NetApp filer.  Because they did the heavy
  lifting in this setup, these machines were somewhat beefy, with dual
  CPUs and 1GB of RAM each.
  
  =for html 
  <p><b>Figure 3.</b> Application servers layout</p>
  <img src="app_servers.png" alt="Application servers layout"
  width="444" height="387">
  
  =head2 Search servers
  
  There was a third set of machines dedicated to handling searches.
  Since searching was such a large percentage of overall traffic, it was
  worthwhile to dedicate resources to it and take the load off the
  application servers and database.
  
  The software on these boxes was a multi-threaded daemon which we
  developed in-house using C++.  The application servers talked to the
  search servers using a Perl module.  The search daemon accepted a set
  of search conditions and returned a sorted list of object IDs of the
  products whose data fits those conditions.  Then the application
  servers looked up the data to display these products from the
  database.  The search servers knew nothing about HTML or the web
  interface.
  
  This approach of finding the IDs with the search server and then
  retrieving the object data may sound like a performance hit, but in
  practice the object data usually came from the application server's
  cache rather than the database.  This design allowed us to minimize
  the duplicated data between the database and the search servers,
  making it easier and faster to refresh the index.  It also let us
  reuse the same Perl code for retrieving product objects from the
  database, regardless of how they were found.
  
  The daemon used a standard inverted word list approach to searching.
  The index was periodically built from the relevant data in
  Oracle. There are modules on CPAN which implement this approach,
  including C<Search::InvertedIndex> and C<DBIx::FullTextSearch>. We
  chose to write our own because of the very tight performance
  requirements on this part of the system, and because we had an
  unusually complex set of sorting rules for the returned IDs.
  
  =for html
  <p><b>Figure 4.</b> Search server layout</p>
  <img src="search_servers.png" alt="Search server layout"
  width="567" height="269">
  
  =head1 Load Balancing and Failover
  
  We took pains to make sure that we would be able to provide load
  balancing among nodes of the cluster and fault tolerance in case one
  or more nodes failed.  The proxy servers were balanced using a random
  selection algorithm.  A user could end up on a different one on every
  request.  These servers didn't hold any state information, so the goal
  was just to distribute the load evenly.
  
  The application servers used ``sticky'' load balancing. That means
  that once a user went to a particular app server, all of her
  subsequent requests during that session were also passed to the same
  app server. The f5 hardware accomplished this using browser cookies.
  Using sticky load balancing on the app servers allowed us to do some
  local caching of user data.
  
  The load balancers ran a periodic service check on every server and
  removed any servers that failed the check from rotation. When a server
  failed, all users that were ``stuck''; to that machine were moved to
  another one.
  
  In order to ensure that no data was lost if an app server died, all
  updates were written to the database.  As a result, user data like the
  contents of a shopping cart was preserved even in cases of
  catastrophic hardware failure on an app server.  This is essential for
  a large e-commerce site.
  
  The database had a separate failover system, which we will not go into
  here. It followed standard practices recommended by our vendors.
  
  =head1 Code Structure
  
  The code was structured around the classic Model-View-Controller
  pattern, originally from SmallTalk and now often applied to web
  applications.  The MVC pattern is a way of splitting an application's
  responsibilities into three distinct layers.
  
  Classes in the Model layer represented business concepts and data,
  like products or users.  These had an API but no end-user interface.
  They knew nothing about HTTP or HTML and could be used in non-web
  applications, like cron jobs.  They talked to the database and other
  data sources, and managed their own persistence.
  
  The Controller layer translated web requests into appropriate actions
  on the Model layer.  It handled parsing parameters, checking input,
  fetching the appropriate Model objects, and calling methods on them.
  Then it determined the appropriate View to use and sendt the resulting
  HTML to the user.
  
  View objects were really HTML templates.  The Controller passed data
  from the Model objects to them and they generated a web page.  These
  were implemented with the Template Toolkit, a powerful templating
  system written in Perl.  The templates had some basic conditional
  statements and looping in them, but only enough to express the
  formatting logic.  No application control flow was embedded in the
  templates.
  
  =for html 
  <p><b>Figure 5. </b> <a href="code_structure.png">Code
  structure and interaction between the layers</a></p>
  
  =head1 Caching
  
  The core of the performance strategy is a multi-tier caching
  system. On the application servers, data objects are cached in shared
  memory with a backing store on local disk. Applications specify how
  long a data object can be out of sync with the database, and all
  future accesses during that time are served from the high-speed
  cache. This type of cache control is known as "time-to-live." The
  local cache is implemented using a I<Berkeley DB> database. Objects
  are serialized with the standard C<Storable> module from CPAN.
  
  Data objects are divided into pieces when necessary to provide finer
  granularity for expiration times. For example, product inventory is
  updated more frequently than other product data. By splitting the
  product data up, we can use a short expiration for inventory that
  keeps it in tighter sync with the database, while still using a longer
  expiration for the less volatile parts of the product data.
  
  The application servers' object caches share product data between them
  using the IP Multicast protocol and custom daemons written in C. When
  a product is placed in the cache on one server, the data is replicated
  to the cache on all other servers. This technique is very successful
  because of the high locality of access in product data. During the
  2000 Christmas season this cache achieved a 99% hit ratio, thus taking
  a large amount of work off the database.
  
  In addition to caching the data objects, entire pages that are not
  user-specific, like product detail pages, can be cached. The
  application takes the shortest expiration time of the data objects
  used in the pages and specifies that to the proxy servers as a page
  expiration time, using standard I<Expires> headers. The proxy servers
  cache the generated page on a shared NFS partition. Pages served from
  this cache have performance close to that of static pages.
  
  To allow for emergency fixes, we added a hook to C<mod_proxy> that
  deletes the cached copy of a specified URL. This was used when a page
  needed to be changed immediately to fix incorrect information.
  
  An extra advantage of this C<mod_proxy> cache is the automatic
  handling of I<If-Modified-Since> requests. We did not need to
  implement this ourselves since C<mod_proxy> already provides it.
  
  =for html
  <p><b>Figure 6. </b> <a href="proxy_architecture.png">Proxy and Cache
  Interaction</a></p>
  
  =head1 Session Tracking
  
  Users are assigned session IDs using HTTP cookies. This is done at the
  proxy servers by our customized version of C<mod_session>. Doing it at
  the proxy ensures that users accessing cached pages will still get a
  session ID assigned. The session ID is simply a key into data stored
  on the server-side. User sessions are assigned to an application
  server and continue to use that server unless it becomes
  unavailable. This is called ``sticky&rdquo; load balancing. Session
  data and other data modified by the user -- such as shopping cart
  contents -- is written to both the object cache and the database. The
  double write carries a slight performance penalty, but it allows for
  fast read access on subsequent requests without going back to the
  database. If a server failure causes a user to be moved to a different
  application server, the data is simply fetched from the database
  again.
  
  =for html
  <p><b>Figure 7.</b> Session tracking and caches</p>
  <img src="session_tracking.png" alt="Session tracking and caches" width="331"
  height="155">
  
  =head1 Security
  
  A large e-commerce site is a popular target for all types of attacks.
  When designing such a system, you have to assume that you will be
  attacked and build with security in mind, at the application level as
  well as the machine level.
  
  The main rule of thumb is ``don't trust the client!''  User-specific
  data sent to the client is protected using multiple levels of
  encryption. SSL keeps sensitive data exchanges private from anyone
  snooping on network traffic. To prevent ``session hijacking'' (when
  someone tampers with their session ID in order to gain access to
  another user's session), we include a Message Authentication Code
  (MAC) as part of the session cookie. This is generated using the
  standard C<Digest::SHA1> module from CPAN, with a seed phrase known
  only to our servers. By running the ID from the session cookie through
  this MAC algorithm we can verify that the data being presented was
  generated by us and not tampered with.
  
  In situations where we need to include some state information in an
  HTML form or URL and don't want it to be obvious to the user, we use
  the CPAN C<Crypt::> modules to encrypt and decrypt it. The
  C<Crypt::CBC> module is a good place to start.
  
  To protect against simple overload attacks, when someone uses a
  program to send high volumes of requests at our servers hoping to make
  them unavailable to customers, access to the application servers is
  controlled by a throttling program. The code is based on some work by
  Randal Schwartz in his C<Stonehenge::Throttle> module. Accesses for
  each user are tracked in compact logs written to an NFS partition. The
  program enforces limits on how many requests a user can make within a
  certain period of time.
  
  For more information on web security concerns including the use of
  MAC, encryption, and overload prevention, we recommend looking at the
  books I<CGI Programming with Perl, 2nd Edition> and I<Writing Apache
  Modules with Perl and C>, both from O'Reilly.
  
  =head1 Exception Handling
  
  When planning this system, we considered using Java as the
  implementation language. We decided to go with Perl, but we really
  missed Java's nice exception handling features. Luckily, Graham Barr's
  Error module from CPAN supplies similar capabilities in Perl.
  
  Perl already has support for trapping runtime errors and passing
  exception objects, but the Error module adds some nice syntactic
  sugar.  The following code sample is typical of how we used the
  module:
  
      try {
        do_some_stuff();
      } catch My::Exception with {
        my $E = shift;
        handle_exception($E);
      };
  
  The module allows you to create your own exception classes and trap
  for specific types of exceptions.
  
  One nice benefit of this is the way it works with C<DBI>. If you turn
  on C<DBI>'s I<RaiseError> flag and use try blocks in places where you
  want to trap exceptions, the C<Error> module can turn C<DBI> errors
  into simple C<Error> objects.
  
      try {
          $sth->execute();
      } catch Error with {
          # roll back and recover
          $dbh->rollback();
          # etc.
      };
  
  This code shows a condition where an error would indicate that we
  should roll back a database transaction. In practice, most C<DBI>
  errors indicate something unexpected happened with the database and
  the current action can't continue. Those exceptions are allowed to
  propagate up to a top-level C<try{}> block that encloses the whole
  request. When errors are caught there, we log a stacktrace and send a
  friendly error page back to the user.
  
  =head1 Templates
  
  Both the HTML and the formatting logic for merging application data
  into it is stored in the templates. They use a CPAN module called
  I<Template Toolkit>, which provides a simple but powerful syntax for
  accessing the Perl data structures passed to them by the application.
  In addition to basics like looping and conditional statements, it
  provides extensive support for modularization, allowing the use of
  includes and macros to simplify template maintenance and avoid
  redundancy.
  
  We found I<Template Toolkit> to be an invaluable tool on this project.
  Our HTML coders picked it up very quickly and were able to do nearly
  all of the templating work without help from the Perl coders. We
  supplied them with documentation of what data would be passed to each
  template and they did the rest. If you have never experienced the joy
  of telling a project manager that the HTML team can handle his
  requested changes without any help from you, you are seriously missing
  out!
  
  I<Template Toolkit> compiles templates into Perl bytecode and caches
  them in memory to improve efficiency. When template files change on
  disk they are picked up and re-compiled. This is similar to how other
  C<mod_perl> systems like Mason and C<Apache::Registry> work.
  
  By varying the template search path, we made it possible to assign
  templates to particular sections of the site, allowing a customized
  look and feel for specific areas. For example, the page header
  template in the bookstore section of the site can be different from
  the one in the video game store section. It is even possible to serve
  the same data with a different appearance in different parts of the
  site, allowing for co-branding of content.
  
  This is a sample of what a basic loop looks like when coded in
  I<Template Toolkit>:
  
      [% FOREACH item = cart.items %]
      name: [% item.name %]
      price: [% item.price %]
      [% END %]
  
  =head1 Controller Example
  
  Let's walk through a simple Hello World example that illustrates how
  the Model-View-Controller pattern is used in our code. We'll start with
  the controller code.
  
      package ESF::Control::Hello;
      use strict;
      use ESF::Control;
      @ESF::Control::Hello::ISA = qw(ESF::Control);
      use ESF::Util;
      sub handler {
          ### do some setup work
          my $class = shift;
          my $apr = ESF::Util->get_request();
          
          ### instantiate the model
          my $name = $apr->param('name');
          
          # we create a new Model::Hello object.
          my $hello = ESF::Model::Hello-E<gt>new(NAME =E<gt> $name);
          
          ### send out the view
          my $view_data{'hello'} = $hello->view();
          
          # the process_template() method is inherited
          # from the ESF::Control base class
          $class->process_template(
                  TEMPLATE => 'hello.html',
                  DATA     => \%view_data);
      }
  
  In addition to the things you see here, there are a few interesting
  details about the C<ESF::Control> base class. All requests are
  dispatched to the C<ESF::Control-E<gt>run()> method first, which wraps
  them in a C<try{}> block before calling the appropriate C<handler()>
  method. It also provides the C<process_template()> method, which runs
  C<Template Toolkit> and then sends out the results with appropriate
  HTTP headers. If the Controller specifies it, the headers can include
  C<Last-Modified> and C<Expires>, for control of page caching by the
  proxy servers.
  
  Now let's look at the corresponding Model code.
  
      package ESF::Model::Hello;
      use strict;
      sub new {
          my $class = shift;
          my %args = @_;
          my $self = bless {}, $class;
          $self{'name'} = $args{'NAME'} || 'World';
          return $self;
      }
      
      sub view {
          # the object itself will work for the view
          return shift;
      }
  
  This is a very simple Model object. Most Model objects would have some
  database and cache interaction. They would include a C<load()> method
  which accepts an ID and loads the appropriate object state from the
  database. Model objects that can be modified by the application would
  also include a C<save()> method.
  
  Note that because of Perl's flexible OO style, it is not necessary to
  call C<new()> when loading an object from the database. The C<load()>
  and C<new()> methods can both be constructors for use in different
  circumstances, both returning a blessed reference.
  
  The C<load()> method typically handles cache management as well as
  database access. Here's some pseudo-code showing a typical C<load()>
  method:
  
      sub load {
          my $class = shift;
          my %args = @_;
          my $id = $args{'ID'};
          my $self;
          unless ($self = _fetch_from_cache($id)) {
              $self = _fetch_from_database($id);
              $self->_store_in_cache();
          }
          return $self;
      }
  
  The save method would use the same approach in reverse, saving first
  to the cache and then to the database.
  
  One final thing to notice about our Model class is the C<view()>
  method. This method exists to give the object an opportunity to
  shuffle it's data around or create a separate data structure that is
  easier for use with a template. This can be used to hide a complex
  implementation from the template coders. For example, remember the
  partitioning of the product inventory data that we did to allow for
  separate cache expiration times? The product Model object is really a
  faE<ccedil>ade for several underlying implementation objects, but the
  C<view()> method on that class consolidates the data for use by the
  templates.
  
  To finish off our Hello World example, we need a template to render
  the view. This one will do the job:
  
      <html>
      <title>Hello, My Oyster</title>
      <body>
          [% PROCESS header.html %]
          Hello [% hello.name %]!
          [% PROCESS footer.html %]
      </body>
      </html>
  
  =head1 Performance Tuning
  
  Since Perl code executes so quickly under C<mod_perl>, the performance
  bottleneck is usually at the database. We applied all the documented
  tricks for improving C<DBD::Oracle> performance. We used bind
  variables, C<prepare_cached()>, C<Apache::DBI>, and adjustments to the
  C<RowCache> buffer size.
  
  The big win of course is avoiding going to the database in the first
  place. The caching work we did had a huge impact on performance.
  Fetching product data from the I<Berkeley DB> cache was about ten
  times faster than fetching it from the database. Serving a product
  page from the proxy cache was about ten times faster than generating
  it on the application server from cached data. Clearly the site would
  never have survived under heavy load without the caching.
  
  Partitioning the data objects was also a big win. We identified
  several different subsets of product data that could be loaded and
  cached independently. When an application needed product data, it
  could specify which subset was required and skip loading the
  unnecessary data from the database.
  
  Another standard performance technique we followed was avoiding
  unnecessary object creation. The C<Template> object is created the
  first time it's used and then cached for the life of the Apache
  process. Socket connections to search servers are cached in a way
  similar to what C<Apache::DBI> does for database
  connections. Resources that are used frequently within the scope of a
  request, such as database handles and session objects, were cached in
  C<mod_perl>'s C<$r-E<gt>pnotes()> until the end of the request.
  
  =head1 Trap: Nested Exceptions
  
  When trying out a new technology like the C<Error> module, there are
  bound to be some things to watch out for. We found a certain code
  structure that causes a memory leak every time it is executed. It
  involves nested C<try{}> blocks, and looks like this:
  
      my $foo;
      try {
          # some stuff...
          try {
              $foo++;
              # more stuff...
          } catch Error with {
              # handle error
          };
      
      } catch Error with {
          # handle other error
      };
  
  It's not Graham Barr's fault that this leaks; it is simply a
  by-product of the fact that the C<try> and C<catch> keywords are
  implemented using anonymous subroutines. This code is equivalent to
  the following:
  
      my $foo;
      $subref1 = sub {
          $subref2 = sub {
              $foo++;
          };
      };
  
  This nested subroutine creates a closure for C<$foo> and will make a
  new copy of the variable every time it is executed. The situation is
  easy to avoid once you know to watch out for it.
  
  =head2 Berkeley DB
  
  One of the big wins in our architecture was the use of I<Berkeley DB>.
  Since most people are not familiar with it's more advanced features,
  we'll give a brief overview here.
  
  The C<DB_File> module is part of the standard Perl distribution.
  However, it only supports the interface of I<Berkeley DB> version
  1.85, and doesn't include the interesting features of later
  releases. To get those, you'll need the C<BerkeleyDB.pm> module,
  available from CPAN.  This module can be tricky to build, but
  comprehensive instructions are included.
  
  Newer versions of I<Berkeley DB> offer many features that help
  performance in a C<mod_perl> environment. To begin with, database
  files can be opened once at the start of the program and kept open,
  rather than opened and closed on every request. I<Berkeley DB> will
  use a shared memory buffer to improve data access speed for all
  processes using the database. Concurrent access is directly supported
  with locking handled for you by the database. This is a huge win over
  C<DB_File>, which requires you to do your own locking. Locks can be at
  a database level, or at a memory page level to allow multiple
  simultaneous writers. Transactions with rollback capability are also
  supported.
  
  This all sounds too good to be true, but there are some downsides. The
  documentation is somewhat sparse, and you will probably need to refer
  to the C API if you need to understand how to do anything complicated.
  
  A more serious problem is database corruption. When an Apache process
  using I<Berkeley DB> dies from a hard kill or a segfault, it can
  corrupt the database. A corrupted database will sometimes cause
  subsequent opening attempts to hang. According to the people we talked
  to at Sleepycat Software (which provides commercial support for
  I<Berkeley DB>), this can happen even with the transactional mode of
  operation. They are working on a way to fix the problem. In our case,
  none of the data stored in the cache was essential for operation so we
  were able to simply clear it out when restarting an application
  server.
  
  Another thing to watch out for is deadlocks. If you use the page-level
  locking option, you have to handle deadlocks. There is a daemon
  included in the distribution that will watch for deadlocks and fix
  them, or you can handle them yourself using the C API.
  
  After trying a few different things, we recommend that you use
  database-level locking. It's much simpler, and cured our problems. We
  didn't see any significant performance hit from switching to this mode
  of locking. The one thing you need to watch out for when using
  exclusive database level write locks are long operations with cursors
  that tie up the database. We split up some of our operations into
  multiple writes in order to avoid this problem.
  
  If you have a good C coder on your team, you may want to try the
  alternate approach that we finally ended up with. You can write your
  own daemon around I<Berkeley DB> and use it in a client/server style
  over Unix sockets. This allows you to catch signals and ensure a safe
  shutdown. You can also write your own deadlock handling code this way.
  
  =head1 Valuable Tools
  
  If you plan to do any serious Perl development, you should really take
  the time to become familiar with some of the available development
  tools. The debugger in particular is a lifesaver, and it works with
  C<mod_perl>. There is a profiler called C<Devel::DProf>, which also
  works with C<mod_perl>. It's definitely the place to start when
  performance tuning your application.
  
  We found the ability to run our complete system on individual's
  workstations to be extremely useful. Everyone could develop on his own
  machine, and coordinate changes using I<CVS> source control.
  
  For object modeling and design, we used the open source C<Dia> program
  and I<Rational Rose>. Both support working with UML and are great for
  generating pretty class diagrams for your cubicle walls.
  
  =head1 Do Try This at Home
  
  Since we started this project, a number of development frameworks that
  offer support for this kind of architecture have come out. We don't
  have direct experience using these, but they have a similar design and
  may prove useful to you if you want to take an MVC approach with your
  system.
  
  C<Apache::PageKit> is a C<mod_perl> module available from CPAN which
  provides a basic MVC structure for web applications. It uses the
  C<HTML::Template> module for building views.
  
  I<OpenInteract> is a recently released web application framework in
  Perl, which works together with the persistence layer C<SPOPS>. Both
  are available from CPAN.
  
  The I<Application Toolkit> from Extropia is a comprehensive set of
  Perl classes for building web apps. It has excellent documentation and
  takes good advantage of existing CPAN modules. You can find it on
  http://www.extropia.com/.
  
  If you want a ready-to-use cache module, take a look at the Perl-cache
  project on http://sourceforge.net/. This is the next generation of the
  popular C<File::Cache> module.
  
  The Java world has many options as well. The I<Struts> framework, part
  of the I<Jakarta> project, is a good open source choice. There are
  also commercial products from several vendors that follow this sort of
  design. Top contenders include I<ATG Dynamo>, I<BEA WebLogic>, and
  I<IBM WebSphere>.
  
  =head1 An Open Source Success Story
  
  By building on the open source software and community, we were able to
  create a top-tier web site with a minimum of cost and effort. The
  system we ended up with is scalable to huge amounts of traffic. It
  runs on mostly commodity hardware making it easy to grow when the need
  arises. Perhaps best of all, it provided tremendous learning
  opportunities for our developers, and made us a part of the larger
  development community.
  
  We've contributed patches from our work back to various open source
  projects, and provided help on mailing lists. We'd like to take this
  opportunity to officially thank the open source developers who
  contributed to projects mentioned here. Without them, this would not
  have been possible. We also have to thank the hardworking web
  developers at eToys. The store may be closed, but the talent that
  built it lives on.
  
  =head1 Maintainers
  
  The maintainer is the person(s) you should contact with updates,
  corrections and patches.
  
  Per Einar Ellefsen E<lt>per.einar (at) skynet.beE<gt>
  
  =head1 Authors
  
  =over
  
  =item * Bill Hilf E<lt>bill (at) hilfworks.comE<gt>
  
  =item * Perrin Harkins E<lt>perrin (at) elem.comE<gt>
  
  =back
  
  Only the major authors are listed above. For contributors see the
  Changes file.
  
  =cut
  
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/machine_layout.png
  
        <<Binary file>>
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/proxy_architecture.png
  
        <<Binary file>>
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/proxy_servers.png
  
        <<Binary file>>
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/search_servers.png
  
        <<Binary file>>
  
  
  1.1                  
modperl-docs/src/docs/tutorials/scale_etoys/session_tracking.png
  
        <<Binary file>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

cvs commit: modperl-docs/src/docs/tutorials/scale_etoys app_servers.png code_structure.png etoys.pod machine_layout.png proxy_architecture.png proxy_servers.png search_servers.png session_tracking.png

Reply via email to