stas 02/05/10 00:45:11 Modified: src/docs/general config.cfg cvs_howto.pod Added: src/docs/general .cvsignore advocacy.pod control.pod hardware.pod multiuser.pod perl_myth.pod perl_reference.pod Log: docs common to all mod_perl versions Submitted by: Thomas Klausner <[EMAIL PROTECTED]> Revision Changes Path 1.2 +9 -2 modperl-docs/src/docs/general/config.cfg Index: config.cfg =================================================================== RCS file: /home/cvs/modperl-docs/src/docs/general/config.cfg,v retrieving revision 1.1 retrieving revision 1.2 diff -u -r1.1 -r1.2 --- config.cfg 29 Apr 2002 16:48:06 -0000 1.1 +++ config.cfg 10 May 2002 07:45:11 -0000 1.2 @@ -6,11 +6,18 @@ title => "General Documentation", abstract => <<EOB, -Here you can find documentation not directly concerned with mod_perl, -but still usefull for most mod_perl projects. +Here you can find documentation concerning mod_perl in general, +but also not strictly mod_perl related information that is still +very usefull for working with mod_perl. EOB chapters => [qw( + perl_reference.pod + multiuser.pod + hardware.pod + control.pod + advocacy.pod + perl_myth.pod cvs_howto.pod Changes.pod )], 1.3 +1 -1 modperl-docs/src/docs/general/cvs_howto.pod Index: cvs_howto.pod =================================================================== RCS file: /home/cvs/modperl-docs/src/docs/general/cvs_howto.pod,v retrieving revision 1.2 retrieving revision 1.3 diff -u -r1.2 -r1.3 --- cvs_howto.pod 29 Apr 2002 17:08:17 -0000 1.2 +++ cvs_howto.pod 10 May 2002 07:45:11 -0000 1.3 @@ -36,7 +36,7 @@ % cvs -d ":pserver:[EMAIL PROTECTED]:/home/cvspublic" co modperl After cvs finished downloading the files you will find a new directory -calles I<modperl> in the current working directory. +called I<modperl> in the current working directory. =head2 keeping your copy up to date 1.1 modperl-docs/src/docs/general/.cvsignore Index: .cvsignore =================================================================== cache.*.dat 1.1 modperl-docs/src/docs/general/advocacy.pod Index: advocacy.pod =================================================================== =head1 NAME mod_perl Advocacy =head1 Description Having a hard time getting mod_perl into your organization? We have collected some arguments you can use to convince your boss why the organization wants mod_perl. You can contact the L<mod_perl advocacy list|maillist::list-advocacy> if you have any more questions, or good arguments you have used (any success-stories are also welcome to L<the docs-dev list|maillist::list-docs-dev>). Also see L<Popular Perl Complaints and Myths|docs::general::perl_myth>. =head1 Thoughts about scalability and flexibility Your need for scalability and flexibility depends on what you need from your web site. If you only want a simple guest book or database gateway with no feature headroom, you can get away with any EASY_AND_FAST_TO_DEVELOP_TOOL (Exchange, MS IIS, Lotus Notes, etc). Experience shows that you will soon want more functionality, at which point you'll discover the limitations of these "easy" tools. Gradually, your boss will ask for increasing functionality and at some point you'll realize that the tool lacks flexibility and/or scalability. Then your boss will either buy another EASY_AND_FAST_TO_DEVELOP_WITH_TOOLS and repeat the process (with different unforseen problems), or you'll start investing time in learning how to use a powerful, flexible tool to make the long-term development cycle easier. If you and your company are serious about delivering flexible Internet functionality, do your homework. Then urge your boss to invest a little extra time and resources in choosing the right tool for the job. The extra quality and manageability of your site along with your ability to deliver new and improved functionality of high quality and in good time will prove the superiority of using solid flexible tools. =head1 The boss, the developer and advocacy Each developer has a boss who participates in the decision-making process. Remember that the boss considers input from sales people, developers, the media and associates before handing down large decisions. Of course, results count! A sales brochure makes very little impact compared to a working demonstration, and demonstrations of company-specific and developer-specific results count for a lot! Personally, when I discovered mod_perl I did a lot of testing and coding at home and at work. Once I had a working heavy application, I came to my boss with two URLs - one for the plain CGI server and the other for the mod_perl-enabled server. It took about 30 secs for my boss to say: `Go with it'. Of course since then I have had to provide all the support for other developers, which is why I took time to learn it in first place (and why this guide was created!). Chances are that if you've done your homework, learnt the tools and can deliver results, you'll have a successful project. If you convince your boss to try a tool that you don't know very well, your results may suffer. If your boss follows your development process closely and sees that your progress is much worse than expected, you might be told to "forget it" and mod_perl might not get a second chance. Advocacy is a great thing for the open-source software movement, but it's best done quietly until you have confidence that you can show productivity. If you can demonstrate to your boss a heavy CGI which is running much faster under mod_perl, that may be a strong argument for further evaluation. Your company may even sponsor a portion of your learning process. Learn the technology by working on sample projects. Learn how to support yourself and learn how to get support from the community; then advocate your ideas to your boss. Then you'll have the knowledge; your company will have the benefit; and mod_perl will have the reputation it deserves. =head1 A summary of perl/CGI discussion at slashdot.org Well, there was a nice discussion of merits of Perl in CGI world. I took the time to summarize this thread, so here is what I've got: Perl Domination in CGI Programming? http://slashdot.org/askslashdot/99/10/20/1246241.shtml =over 4 =item * Perl is cool and fun to code with. =item * Perl is very fast to develop with. =item * Perl is even faster to develop with if you know what CPAN is. :) =item * Math intensive code and other stuff which is faster in C/C++, can be plugged into Perl with XS/SWIG and may be used transparently by Perl programmers. =item * Most CGI applications do text processing, at which Perl excels =item * Forking and loading (unless the code is shared) of C/C++ CGI programs produces an overhead. =item * Except for Intranets, bandwidth is usually a bigger bottleneck than Perl performance, although this might change in the future. =item * For database driven applications, the database itself is a bottleneck. Lots of posts talk about latency vs throughput. =item * mod_perl, FastCGI, Velocigen and PerlEx all give good performance gains over plain mod_cgi. =item * Other light alternatives to Perl and its derivatives which have been mentioned: PHP, Python. =item * There were almost no voices from users of M$ and similar technologies, I guess that's because they don't read http://slashdot.org :) =item * Many said that in many people's minds: 'CGI' eq 'Perl' =back =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back =head1 Authors =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut 1.1 modperl-docs/src/docs/general/control.pod Index: control.pod =================================================================== =head1 NAME Controlling and Monitoring the Server =head1 Description Covers techniques to restart mod_perl enabled Apache, SUID scripts, monitoring, and other maintenance chores, as well as some specific setups. =head1 Restarting Techniques All of these techniques require that you know the server process id (PID). The easiest way to find the PID is to look it up in the I<httpd.pid> file. It's easy to discover where to look, by looking in the I<httpd.conf> file. Open the file and locate the entry C<PidFile>. Here is the line from one of my own I<httpd.conf> files: PidFile /usr/local/var/httpd_perl/run/httpd.pid As you see, with my configuration the file is I</usr/local/var/httpd_perl/run/httpd.pid>. Another way is to use the C<ps> and C<grep> utilities. Assuming that the binary is called I<httpd_perl>, we would do: % ps auxc | grep httpd_perl or maybe: % ps -ef | grep httpd_perl This will produce a list of all the C<httpd_perl> (parent and children) processes. You are looking for the parent process. If you run your server as root, you will easily locate it since it belongs to root. If you run the server as some other user (when you L<don't have root access|guide::install/Installation_Without_Superuser_Privileges>, the processes will belong to that user unless defined differently in I<httpd.conf>. It's still easy to find which is the parent--usually it's the process with the smallest PID. You will see several C<httpd> processes running on your system, but you should never need to send signals to any of them except the parent, whose pid is in the I<PidFile>. There are three signals that you can send to the parent: C<SIGTERM>, C<SIGHUP>, and C<SIGUSR1>. Some folks prefer to specify signals using numerical values, rather than using symbols. If you are looking for these, check out your C<kill(1)> man page. My page points to I</usr/include/linux/signal.h>, the relevant entries are: #define SIGHUP 1 /* hangup, generated when terminal disconnects */ #define SIGKILL 9 /* last resort */ #define SIGTERM 15 /* software termination signal */ #define SIGUSR1 30 /* user defined signal 1 */ Note that to send these signals from the command line the C<SIG> prefix must be omitted and under some operating systems they will need to be preceded by a minus sign, e.g. C<kill -15> or C<kill -TERM> followed by the PID. =head1 Server Stopping and Restarting We will concentrate here on the implications of sending C<TERM>, C<HUP>, and C<USR1> signals (as arguments to kill(1)) to a mod_perl enabled server. See http://www.apache.org/docs/stopping.html for documentation on the implications of sending these signals to a plain Apache server. =over 4 =item TERM Signal: Stop Now Sending the C<TERM> signal to the parent causes it to immediately attempt to kill off all its children. Any requests in progress are terminated, and no further requests are served. This process may take quite a few seconds to complete. To stop a child, the parent sends it a C<SIGHUP> signal. If that fails it sends another. If that fails it sends the C<SIGTERM> signal, and as a last resort it sends the C<SIGKILL> signal. For each failed attempt to kill a child it makes an entry in the I<error_log>. When all the child processes were terminated, the parent itself exits and any open log files are closed. This is when all the accumulated C<END> blocks, apart from the ones located in scripts running under C<Apache::Registry> or C<Apache::PerlRun> handlers. In the latter case, C<END> blocks are executed after each request is served. =item HUP Signal: Restart Now Sending the C<HUP> signal to the parent causes it to kill off its children as if the C<TERM> signal had been sent, i.e. any requests in progress are terminated; but the parent does not exit. Instead, the parent re-reads its configuration files, spawns a new set of child processes and continues to serve requests. It is almost equivalent to stopping and then restarting the server. If the configuration files contain errors when restart is signaled, the parent will exit, so it is important to check the configuration files for errors before issuing a restart. How to perform the check will be covered shortly; Sometimes using this approach to restart mod_perl enabled Apache may cause the processes memory incremental growth after each restart. This happens when Perl code loaded in memory is not completely torn down, leading to a memory leak. =item USR1 Signal: Gracefully Restart Now The C<USR1> signal causes the parent process to advise the children to exit after serving their current requests, or to exit immediately if they're not serving a request. The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation (the new children use the new configuration) and it begins serving new requests immediately. The only difference between C<USR1> and C<HUP> is that C<USR1> allows the children to complete any current requests prior to killing them off and there is no interruption in the services compared to the killing with C<HUP> signal, where it might take a few seconds for a restart to get completed and there is no real service at this time. =back By default, if a server is restarted (using C<kill -USR1 `cat logs/httpd.pid`> or with the C<HUP> signal), Perl scripts and modules are not reloaded. To reload C<PerlRequire>s, C<PerlModule>s, other C<use()>'d modules and flush the C<Apache::Registry> cache, use this directive in I<httpd.conf>: PerlFreshRestart On Make sure you read L<Evil things might happen when using PerlFreshRestart|guide::troubleshooting/Evil_things_might_happen_when_using_PerlFreshRestart>. =head1 Speeding up the Apache Termination and Restart We've already mentioned that restart or termination can sometimes take quite a long time, (e.g. tens of seconds), for a mod_perl server. The reason for that is a call to the C<perl_destruct()> Perl API function during the child exit phase. This will cause proper execution of C<END> blocks found during server startup and will invoke the C<DESTROY> method on global objects which are still alive. It is also possible that this operation may take a long time to finish, causing a long delay during a restart. Sometimes this will be followed by a series of messages appearing in the server I<error_log> file, warning that certain child processes did not exit as expected. This happens when after a few attempts advising the child process to quit, the child is still in the middle of perl_destruct(), and a lethal C<KILL> signal is sent, aborting any operation the child has happened to execute and I<brutally> killing it. If your code does not contain any C<END> blocks or C<DESTROY> methods which need to be run during child server shutdown, or may have these, but it's insignificant to execute them, this destruction can be avoided by setting the C<PERL_DESTRUCT_LEVEL> environment variable to C<-1>. For example add this setting to the I<httpd.conf> file: PerlSetEnv PERL_DESTRUCT_LEVEL -1 What constitutes a significant cleanup? Any change of state outside of the current process that would not be handled by the operating system itself. So committing database transactions and removing the lock on some resource are significant operations, but closing an ordinary file isn't. =head1 Using apachectl to Control the Server The Apache distribution comes with a script to control the server. It's called C<apachectl> and it is installed into the same location as the httpd executable. We will assume for the sake of our examples that it's in C</usr/local/sbin/httpd_perl/apachectl>: To start httpd_perl: % /usr/local/sbin/httpd_perl/apachectl start To stop httpd_perl: % /usr/local/sbin/httpd_perl/apachectl stop To restart httpd_perl (if it is running, send C<SIGHUP>; if it is not already running just start it): % /usr/local/sbin/httpd_perl/apachectl restart Do a graceful restart by sending a C<SIGUSR1>, or start if not running: % /usr/local/sbin/httpd_perl/apachectl graceful To do a configuration test: % /usr/local/sbin/httpd_perl/apachectl configtest Replace C<httpd_perl> with C<httpd_docs> in the above calls to control the C<httpd_docs> server. There are other options for C<apachectl>, use the C<help> option to see them all. It's important to remember that C<apachectl> uses the PID file, which is specified by the C<PIDFILE> directive in I<httpd.conf>. If you delete the PID file by hand while the server is running, C<apachectl> will be unable to stop or restart the server. =head1 Safe Code Updates on a Live Production Server You have prepared a new version of code, uploaded it into a production server, restarted it and it doesn't work. What could be worse than that? You also cannot go back, because you have overwritten the good working code. It's quite easy to prevent it, just don't overwrite the previous working files! Personally I do all updates on the live server with the following sequence. Assume that the server root directory is I</home/httpd/perl/rel>. When I'm about to update the files I create a new directory I</home/httpd/perl/beta>, copy the old files from I</home/httpd/perl/rel> and update it with the new files. Then I do some last sanity checks (check file permissions are [read+executable], and run C<perl -c> on the new modules to make sure there no errors in them). When I think I'm ready I do: % cd /home/httpd/perl % mv rel old && mv beta rel && stop && sleep 3 && restart && err Let me explain what this does. Firstly, note that I put all the commands on one line, separated by C<&&>, and only then press the C<Enter> key. As I am working remotely, this ensures that if I suddenly lose my connection (sadly this happens sometimes) I won't leave the server down if only the C<stop> command squeezed in. C<&&> also ensures that if any command fails, the rest won't be executed. I am using aliases (which I have already defined) to make the typing easier: % alias | grep apachectl graceful /usr/local/apache/bin/apachectl graceful rehup /usr/local/apache/sbin/apachectl restart restart /usr/local/apache/bin/apachectl restart start /usr/local/apache/bin/apachectl start stop /usr/local/apache/bin/apachectl stop % alias err tail -f /usr/local/apache/logs/error_log Taking the line apart piece by piece: mv rel old && back up the working directory to I<old> mv beta rel && put the new one in its place stop && stop the server sleep 3 && give it a few seconds to shut down (it might take even longer) restart && C<restart> the server err view of the tail of the I<error_log> file in order to see that everything is OK C<apachectl> generates the status messages a little too early (e.g. when you issue C<apachectl stop> it says the server has been stopped, while in fact it's still running) so don't rely on it, rely on the C<error_log> file instead. Also notice that I use C<restart> and not just C<start>. I do this because of Apache's potentially long stopping times (it depends on what you do with it of course!). If you use C<start> and Apache hasn't yet released the port it's listening to, the start would fail and C<error_log> would tell you that the port is in use, e.g.: Address already in use: make_sock: could not bind to port 8080 But if you use C<restart>, it will wait for the server to quit and then will cleanly restart it. Now what happens if the new modules are broken? First of all, I see immediately an indication of the problems reported in the C<error_log> file, which I C<tail -f> immediately after a restart command. If there's a problem, I just put everything back as it was before: % mv rel bad && mv old rel && stop && sleep 3 && restart && err Usually everything will be fine, and I have had only about 10 seconds of downtime, which is pretty good! =head1 An Intentional Disabling of Live Scripts What happens if you really must take down the server or disable the scripts? This situation might happen when you need to do some maintenance work on your database server. If you have to take your database down then any scripts that use it will fail. If you do nothing, the user will see either the grey C<An Error has happened> message or perhaps a customized error message if you have added code to trap and customize the errors. See L<Redirecting Errors to the Client instead of to the error_log|guide::snippets/Redirecting_Errors_to_the_Client_Instead_of_error_log> for the latter case. A much friendlier approach is to confess to your users that you are doing some maintenance work and plead for patience, promising (keep the promise!) that the service will become fully functional in X minutes. There are a few ways to do this: The first doesn't require messing with the server. It works when you have to disable a script running under C<Apache::Registry> and relies on the fact that it checks whether the file was modified before using the cached version. Obviously it won't work under other handlers because these serve the compiled version of the code and don't check to see if there was a change in the code on the disk. So if you want to disable an C<Apache::Registry> script, prepare a little script like this: /home/http/perl/maintenance.pl ---------------------------- #!/usr/bin/perl -Tw use strict; use CGI; my $q = new CGI; print $q->header, $q->p( "Sorry, the service is temporarily down for maintenance. It will be back in ten to fifteen minutes. Please, bear with us. Thank you!"); So if you now have to disable a script for example C</home/http/perl/chat.pl>, just do this: % mv /home/http/perl/chat.pl /home/http/perl/chat.pl.orig % ln -s /home/http/perl/maintenance.pl /home/http/perl/chat.pl Of course you server configuration should allow symbolic links for this trick to work. Make sure you have the directive Options FollowSymLinks in the C<E<lt>LocationE<gt>> or C<E<lt>DirectoryE<gt>> section of your I<httpd.conf>. When you're done, it's easy to restore the previous setup. Just do this: % mv /home/http/perl/chat.pl.orig /home/http/perl/chat.pl which overwrites the symbolic link. Now make sure that the script will have the current timestamp: % touch /home/http/perl/chat.pl Apache will automatically detect the change and will use the moved script instead. The second approach is to change the server configuration and configure a whole directory to be handled by a C<My::Maintenance> handler (which you must write). For example if you write something like this: My/Maintenance.pm ------------------ package My::Maintenance; use strict; use Apache::Constants qw(:common); sub handler { my $r = shift; print $r->send_http_header("text/plain"); print qq{ We apologize, but this service is temporarily stopped for maintenance. It will be back in ten to fifteen minutes. Please, bear with us. Thank you! }; return OK; } 1; and put it in a directory that is in the server's C<@INC>, to disable all the scripts in Location C</perl> you would replace: <Location /perl> SetHandler perl-script PerlHandler My::Handler [snip] </Location> with <Location /perl> SetHandler perl-script PerlHandler My::Maintenance [snip] </Location> Now restart the server. Your users will be happy to go and read http://slashdot.org for ten minutes, knowing that you are working on a much better version of the service. If you need to disable a location handled by some module, the second approach would work just as well. =head1 SUID Start-up Scripts If you want to allow a few people in your team to start and stop the server you will have to give them the root password, which is not a good thing to do. The less people know the password, the less problems are likely to be encountered. But there is an easy solution for this problem available on UNIX platforms. It's called a setuid executable. =head2 Introduction to SUID Executables The setuid executable has a setuid permissions bit set. This sets the process's effective user ID to that of the file upon execution. You perform this setting with the following command: % chmod u+s filename You probably have used setuid executables before without even knowing about it. For example when you change your password you execute the C<passwd> utility, which among other things modifies the I</etc/passwd> file. In order to change this file you need root permissions, the C<passwd> utility has the setuid bit set, therefore when you execute this utility, its effective ID is the same of the root user ID. You should avoid using setuid executables as a general practice. The less setuid executables you have the less likely that someone will find a way to break into your system, by exploiting some bug you didn't know about. When the executable is setuid to root, you have to make sure that it doesn't have the group and world read and write permissions. If we take a look at the C<passwd> utility we will see: % ls -l /usr/bin/passwd -r-s--x--x 1 root root 12244 Feb 8 00:20 /usr/bin/passwd You achieve this with the following command: % chmod 4511 filename The first digit (4) stands for setuid bit, the second digit (5) is a compound of read (4) and executable (1) permissions for the user, and the third and the fourth digits are setting the executable permissions for the group and the world. =head2 Apache Startup SUID Script's Security In our case, we want to allow setuid access only to a specific group of users, who all belong to the same group. For the sake of our example we will use the group named I<apache>. It's important that users who aren't root or who don't belong to the I<apache> group will not be able to execute this script. Therefore we perform the following commands: % chgrp apache apachectl % chmod 4510 apachectl The execution order is important. If you swap the command execution order you will lose the setuid bit. Now if we look at the file we see: % ls -l apachectl -r-s--x--- 1 root apache 32 May 13 21:52 apachectl Now we are all set... Almost... When you start Apache, Apache and Perl modules are being loaded, code can be executed. Since all this happens with root effective ID, any code executed as if the root user was doing that. You should be very careful because while you didn't gave anyone the root password, all the users in the I<apache> group have an indirect root access. Which means that if Apache loads some module or executes some code that is writable by some of these users, users can plant code that will allow them to gain a shell access to root account and become a real root. Of course if you don't trust your team you shouldn't use this solution in first place. You can try to check that all the files Apache loads aren't writable by anyone but root, but there are too many of them, especially in the mod_perl case, where many Perl modules are loaded at the server startup. By the way, don't let all this setuid stuff to confuse you -- when the parent process is loaded, the children processes are spawned as non-root processes. This section has presented a way to allow non-root users to start the server as root user, the rest is exactly the same as if you were executing the script as root in first place. =head2 Sample Apache Startup SUID Script Now if you are still with us, here is an example of the setuid Apache startup script. Note the line marked C<WORKAROUND>, which fixes an obscure error when starting mod_perl enabled Apache by setting the real UID to the effective UID. Without this workaround, a mismatch between the real and the effective UID causes Perl to croak on the C<-e> switch. Note that you must be using a version of Perl that recognizes and emulates the suid bits in order for this to work. This script will do different things depending on whether it is named C<start_httpd>, C<stop_httpd> or C<restart_httpd>. You can use symbolic links for this purpose. suid_apache_ctl --------------- #!/usr/bin/perl -T # These constants will need to be adjusted. $PID_FILE = '/home/www/logs/httpd.pid'; $HTTPD = '/home/www/httpd -d /home/www'; # These prevent taint warnings while running suid $ENV{PATH}='/bin:/usr/bin'; $ENV{IFS}=''; # This sets the real to the effective ID, and prevents # an obscure error when starting apache/mod_perl $< = $>; # WORKAROUND $( = $) = 0; # set the group to root too # Do different things depending on our name ($name) = $0 =~ m|([^/]+)$|; if ($name eq 'start_httpd') { system $HTTPD and die "Unable to start HTTP"; print "HTTP started.\n"; exit 0; } # extract the process id and confirm that it is numeric $pid = `cat $PID_FILE`; $pid =~ /(\d+)/ or die "PID $pid not numeric"; $pid = $1; if ($name eq 'stop_httpd') { kill 'TERM',$pid or die "Unable to signal HTTP"; print "HTTP stopped.\n"; exit 0; } if ($name eq 'restart_httpd') { kill 'HUP',$pid or die "Unable to signal HTTP"; print "HTTP restarted.\n"; exit 0; } die "Script must be named start_httpd, stop_httpd, or restart_httpd.\n"; =head1 Preparing for Machine Reboot When you run your own development box, it's okay to start the webserver by hand when you need to. On a production system it is possible that the machine the server is running on will have to be rebooted. When the reboot is completed, who is going to remember to start the server? It's easy to forget this task, and what happens if you aren't around when the machine is rebooted? After the server installation is complete, it's important not to forget that you need to put a script to perform the server startup and shutdown into the standard system location, for example I</etc/rc.d> under RedHat Linux, or I</etc/init.d/apache> under Debian Slink Linux. This is the directory which contains scripts to start and stop all the other daemons. The directory and file names vary from one Operating System (OS) to another, and even between different distributions of the same OS. Generally the simplest solution is to copy the C<apachectl> script to your startup directory or create a symbolic link from the startup directory to the C<apachectl> script. You will find C<apachectl> in the same directory as the httpd executable after Apache installation. If you have more than one Apache server you will need a separate script for each one, and of course you will have to rename them so that they can co-exist in the same directories. For example on a RedHat Linux machine with two servers, I have the following setup: /etc/rc.d/init.d/httpd_docs /etc/rc.d/init.d/httpd_perl /etc/rc.d/rc3.d/S91httpd_docs -> ../init.d/httpd_docs /etc/rc.d/rc3.d/S91httpd_perl -> ../init.d/httpd_perl /etc/rc.d/rc6.d/K16httpd_docs -> ../init.d/httpd_docs /etc/rc.d/rc6.d/K16httpd_perl -> ../init.d/httpd_perl The scripts themselves reside in the I</etc/rc.d/init.d> directory. There are symbolic links to these scripts in other directories. The names are the same as the script names but they have numerical prefixes, which are used for executing the scripts in a particular order: the lower numbers are executed earlier. When the system starts (level 3) we want the Apache to be started when almost all of the services are running already, therefore I've used I<S91>. For example if the mod_perl enabled Apache issues a C<connect_on_init()> the SQL server should be started before Apache. When the system shuts down (level 6), Apache should be stopped as one of the first processes, therefore I've used C<K16>. Again if the server does some cleanup processing during the shutdown event and requires third party services to be running (e.g. SQL server) it should be stopped before these services. Notice that it's normal for more than one symbolic link to have the same sequence number. Under RedHat Linux and similar systems, when a machine is booted and its runlevel set to 3 (multiuser + network), Linux goes into I</etc/rc.d/rc3.d/> and executes the scripts the symbolic links point to with the C<start> argument. When it sees I<S91httpd_perl>, it executes: /etc/rc.d/init.d/httpd_perl start When the machine is shut down, the scripts are executed through links from the I</etc/rc.d/rc6.d/> directory. This time the scripts are called with the C<stop> argument, like this: /etc/rc.d/init.d/httpd_perl stop Most systems have GUI utilities to automate the creation of symbolic links. For example RedHat Linux includes the C<control-panel> utility, which amongst other things includes the C<RunLevel Manager>. (which can be invoked directly as either ntsysv(8) or tksysv(8)). This will help you to create the proper symbolic links. Of course before you use it, you should put C<apachectl> or similar scripts into the I<init.d> or equivalent directory. Or you can have a symbolic link to some other location instead. The simplest approach is to use the chkconfig(8) utility which adds and removes the services for you. The following example shows how to add an I<httpd_perl> startup script to the system. First move or copy the file into the directory I</etc/rc.d/init.d>: % mv httpd_perl /etc/rc.d/init.d Now open the script in your favorite editor and add the following lines after the main header of the script: # Comments to support chkconfig on RedHat Linux # chkconfig: 2345 91 16 # description: mod_perl enabled Apache Server So now the beginning of the script looks like: #!/bin/sh # # Apache control script designed to allow an easy command line # interface to controlling Apache. Written by Marc Slemko, # 1997/08/23 # Comments to support chkconfig on RedHat Linux # chkconfig: 2345 91 16 # description: mod_perl enabled Apache Server # # The exit codes returned are: # ... Adjust the line: # chkconfig: 2345 91 16 to your needs. The above setting says to says that the script should be started in levels 2, 3, 4, and 5, that its start priority should be 91, and that its stop priority should be 16. Now all you have to do is to ask C<chkconfig> to configure the startup scripts. Before we do that let's look at what we have: % find /etc/rc.d | grep httpd_perl /etc/rc.d/init.d/httpd_perl Which means that we only have the startup script itself. Now we execute: % chkconfig --add httpd_perl and see what has changed: % find /etc/rc.d | grep httpd_perl /etc/rc.d/init.d/httpd_perl /etc/rc.d/rc0.d/K16httpd_perl /etc/rc.d/rc1.d/K16httpd_perl /etc/rc.d/rc2.d/S91httpd_perl /etc/rc.d/rc3.d/S91httpd_perl /etc/rc.d/rc4.d/S91httpd_perl /etc/rc.d/rc5.d/S91httpd_perl /etc/rc.d/rc6.d/K16httpd_perl As you can see C<chkconfig> created all the symbolic links for us, using the startup and shutdown priorities as specified in the line: # chkconfig: 2345 91 16 If for some reason you want to remove the service from the startup scripts, all you have to do is to tell C<chkconfig> to remove the links: % chkconfig --del httpd_perl Now if we look at the files under the directory I</etc/rc.d/> we see again only the script itself. % find /etc/rc.d | grep httpd_perl /etc/rc.d/init.d/httpd_perl Of course you may keep the startup script in any other directory as long as you can link to it. For example if you want to keep this file with all the Apache binaries in I</usr/local/apache/bin>, all you have to do is to provide a symbolic link to this file: % ln -s /usr/local/apache/bin/apachectl /etc/rc.d/init.d/httpd_perl and then: % chkconfig --add httpd_perl Note that in case of using symlinks the link name in I</etc/rc.d/init.d> is what matters and not the name of the script the link points to. =head1 Monitoring the Server. A watchdog. With mod_perl many things can happen to your server. It is possible that the server might die when you are not around. As with any other critical service you need to run some kind of watchdog. One simple solution is to use a slightly modified C<apachectl> script, which I've named I<apache.watchdog>. Call it from the crontab every 30 minutes -- or even every minute -- to make sure the server is up all the time. The crontab entry for 30 minutes intervals: 0,30 * * * * /path/to/the/apache.watchdog >/dev/null 2>&1 The script: #!/bin/sh # this script is a watchdog to see whether the server is online # It tries to restart the server, and if it's # down it sends an email alert to admin # admin's email [EMAIL PROTECTED] # the path to your PID file PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid # the path to your httpd binary, including options if necessary HTTPD=/usr/local/sbin/httpd_perl/httpd_perl # check for pidfile if [ -f $PIDFILE ] ; then PID=`cat $PIDFILE` if kill -0 $PID; then STATUS="httpd (pid $PID) running" RUNNING=1 else STATUS="httpd (pid $PID?) not running" RUNNING=0 fi else STATUS="httpd (no pid file) not running" RUNNING=0 fi if [ $RUNNING -eq 0 ]; then echo "$0 $ARG: httpd not running, trying to start" if $HTTPD ; then echo "$0 $ARG: httpd started" mail $EMAIL -s "$0 $ARG: httpd started" > /dev/null 2>&1 else echo "$0 $ARG: httpd could not be started" mail $EMAIL -s \ "$0 $ARG: httpd could not be started" > /dev/null 2>&1 fi fi Another approach, probably even more practical, is to use the cool C<LWP> Perl package to test the server by trying to fetch some document (script) served by the server. Why is it more practical? Because while the server can be up as a process, it can be stuck and not working. Failing to get the document will trigger restart, and "probably" the problem will go away. Like before we set a cronjob to call this script every few minutes to fetch some very light script. The best thing of course is to call it every minute. Why so often? If your server starts to spin and trash your disk space with multiple error messages filling the I<error_log>, in five minutes you might run out of free disk space which might bring your system to its knees. Chances are that no other child will be able to serve requests, since the system will be too busy writing to the I<error_log> file. Think big--if you are running a heavy service (which is very fast since you are running under mod_perl) adding one more request every minute will not be felt by the server at all. So we end up with a crontab entry like this: * * * * * /path/to/the/watchdog.pl >/dev/null 2>&1 And the watchdog itself: #!/usr/bin/perl -wT # untaint $ENV{'PATH'} = '/bin:/usr/bin'; delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'}; use strict; use diagnostics; use URI::URL; use LWP::MediaTypes qw(media_suffix); my $VERSION = '0.01'; use vars qw($ua $proxy); $proxy = ''; require LWP::UserAgent; use HTTP::Status; ###### Config ######## my $test_script_url = 'http://www.example.com:81/perl/test.pl'; my $monitor_email = '[EMAIL PROTECTED]'; my $restart_command = '/usr/local/sbin/httpd_perl/apachectl restart'; my $mail_program = '/usr/lib/sendmail -t -n'; ###################### $ua = new LWP::UserAgent; $ua->agent("$0/watchdog " . $ua->agent); # Uncomment the proxy if you access a machine from behind a firewall # $proxy = "http://www-proxy.com"; $ua->proxy('http', $proxy) if $proxy; # If it returns '1' it means we are alive exit 1 if checkurl($test_script_url); # Houston, we have a problem. # The server seems to be down, try to restart it. my $status = system $restart_command; my $message = ($status == 0) ? "Server was down and successfully restarted!" : "Server is down. Can't restart."; my $subject = ($status == 0) ? "Attention! Webserver restarted" : "Attention! Webserver is down. can't restart"; # email the monitoring person my $to = $monitor_email; my $from = $monitor_email; send_mail($from,$to,$subject,$message); # input: URL to check # output: 1 for success, 0 for failure ####################### sub checkurl{ my ($url) = @_; # Fetch document my $res = $ua->request(HTTP::Request->new(GET => $url)); # Check the result status return 1 if is_success($res->code); # failed return 0; } # end of sub checkurl # send email about the problem ####################### sub send_mail{ my($from,$to,$subject,$messagebody) = @_; open MAIL, "|$mail_program" or die "Can't open a pipe to a $mail_program :$!\n"; print MAIL <<__END_OF_MAIL__; To: $to From: $from Subject: $subject $messagebody __END_OF_MAIL__ close MAIL; } =head1 Running a Server in Single Process Mode Often while developing new code, you will want to run the server in single process mode. See L<Sometimes it works Sometimes it does Not|guide::porting/Sometimes_it_Works__Sometimes_it_Doesn_t> and L<Names collisions with Modules and libs|guide::porting/Name_collisions_with_Modules_and_libs>. Running in single process mode inhibits the server from "daemonizing", and this allows you to run it under the control of a debugger more easily. % /usr/local/sbin/httpd_perl/httpd_perl -X When you use the C<-X> switch the server will run in the foreground of the shell, so you can kill it with I<Ctrl-C>. Note that in C<-X> (single-process) mode the server will run very slowly when fetching images. Note for Netscape users: If you use Netscape while your server is running in single-process mode, HTTP's C<KeepAlive> feature gets in the way. Netscape tries to open multiple connections and keep them open. Because there is only one server process listening, each connection has to time out before the next succeeds. Turn off C<KeepAlive> in I<httpd.conf> to avoid this effect while developing. If you use the image size parameters, Netscape will be able to render the page without the images so you can press the browser's I<STOP> button after a few seconds. In addition you should know that when running with C<-X> you will not see the control messages that the parent server normally writes to the I<error_log> (I<"server started">, I<"server stopped"> etc). Since C<httpd -X> causes the server to handle all requests itself, without forking any children, there is no controlling parent to write the status messages. =head1 Starting a Personal Server for Each Developer If you are the only developer working on the specific server:port you have no problems, since you have complete control over the server. However, often you will have a group of developers who need to develop mod_perl scripts and modules concurrently. This means that each developer will want to have control over the server - to kill it, to run it in single server mode, to restart it, etc., as well as having control over the location of the log files, configuration settings like C<MaxClients>, and so on. You I<can> work around this problem by preparing a few I<httpd.conf> files and forcing each developer to use httpd_perl -f /path/to/httpd.conf but I approach it in a different way. I use the C<-Dparameter> startup option of the server. I call my version of the server % http_perl -Dstas In I<httpd.conf> I write: # Personal development Server for stas # stas uses the server running on port 8000 <IfDefine stas> Port 8000 PidFile /usr/local/var/httpd_perl/run/httpd.pid.stas ErrorLog /usr/local/var/httpd_perl/logs/error_log.stas Timeout 300 KeepAlive On MinSpareServers 2 MaxSpareServers 2 StartServers 1 MaxClients 3 MaxRequestsPerChild 15 </IfDefine> # Personal development Server for userfoo # userfoo uses the server running on port 8001 <IfDefine userfoo> Port 8001 PidFile /usr/local/var/httpd_perl/run/httpd.pid.userfoo ErrorLog /usr/local/var/httpd_perl/logs/error_log.userfoo Timeout 300 KeepAlive Off MinSpareServers 1 MaxSpareServers 2 StartServers 1 MaxClients 5 MaxRequestsPerChild 0 </IfDefine> With this technique we have achieved full control over start/stop, number of children, a separate error log file, and port selection for each server. This saves Stas from getting called every few minutes by Eric: "Stas, I'm going to restart the server". In the above technique, you need to discover the PID of your parent C<httpd_perl> process, which is written in C</usr/local/var/httpd_perl/run/httpd.pid.stas> (and the same for the user I<eric>). To make things even easier we change the I<apachectl> script to do the work for us. We make a copy for each developer called B<apachectl.username> and we change two lines in each script: PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid.username HTTPD='/usr/local/sbin/httpd_perl/httpd_perl -Dusername' So for the user I<stas> we prepare a startup script called I<apachectl.stas> and we change these two lines in the standard apachectl script as it comes unmodified from Apache distribution. PIDFILE=/usr/local/var/httpd_perl/run/httpd.pid.stas HTTPD='/usr/local/sbin/httpd_perl/httpd_perl -Dstas' So now when user I<stas> wants to stop the server he will execute: apachectl.stas stop And to start: apachectl.stas start Certainly the rest of the C<apachectl> arguments apply as before. You might think about having only one C<apachectl> and know who is calling by checking the UID, but since you have to be root to start the server it is not possible, unless you make the setuid bit on this script, as we've explained in the beginning of this chapter. If you do so, you can have a single C<apachectl> script for all developers, after you modify it to automatically find out the UID of the user, who executes the script and set the right paths. The last thing is to provide developers with an option to run in single process mode by: /usr/local/sbin/httpd_perl/httpd_perl -Dstas -X In addition to making life easier, we decided to use relative links everywhere in the static documents, including the calls to CGIs. You may ask how using relative links will get to the right server port. It's very simple, we use C<mod_rewrite>. To use mod_rewrite you have to configure your I<httpd_docs> server with C<--enable-module=rewrite> and recompile, or use DSO and load the module in I<httpd.conf>. In the I<httpd.conf> of our C<httpd_docs> server we have the following code: RewriteEngine on # stas's server # port = 8000 RewriteCond %{REQUEST_URI} ^/(perl|cgi-perl) RewriteCond %{REMOTE_ADDR} 123.34.45.56 RewriteRule ^(.*) http://example.com:8000/$1 [P,L] # eric's server # port = 8001 RewriteCond %{REQUEST_URI} ^/(perl|cgi-perl) RewriteCond %{REMOTE_ADDR} 123.34.45.57 RewriteRule ^(.*) http://example.com:8001/$1 [P,L] # all the rest RewriteCond %{REQUEST_URI} ^/(perl|cgi-perl) RewriteRule ^(.*) http://example.com:81/$1 [P] The IP addresses are the addresses of the developer desktop machines (where they are running their web browsers). So if an html file includes a relative URI I</perl/test.pl> or even I<http://www.example.com/perl/test.pl>, clicking on the link will be internally proxied to http://www.example.com:8000/perl/test.pl if the click has been made at the user I<stas>'s desktop machine, or to I<http://www.example.com:8001/perl/test.pl> for a request generated from the user I<eric>'s machine, per our above URI rewrite example. Another possibility is to use C<REMOTE_USER> variable if all the developers are forced to authenticate themselves before they can access the server. If you do, you will have to change the C<RewriteRule>s to match C<REMOTE_USER> in the above example. We wish to stress again, that the above setup will work only with relative URIs in the HTML code. If you choose to generate full URIs including non-80 port the requests originated from this HTML code will bypass the light server listening to the default port 80, and go directly to the I<server:port> of the full URI. =head1 Wrapper to Emulate the Server Perl Environment Often you will start off debugging your script by running it from your favorite shell program. Sometimes you encounter a very weird situation when the script runs from the shell but dies when processed as a CGI script by a web-server. The real problem often lies in the difference between the environment variables that is used by your web-server and the ones used by your shell program. For example you may have a set of non-standard Perl directories, used for local Perl modules. You have to tell the Perl interpreter where these directories are. If you don't want to modify C<@INC> in all scripts and modules, you can use a C<PERL5LIB> environment variable, to tell Perl where the directories are. But then you might forget to alter the mod_perl startup script to correct C<@INC> there as well. And if you forget this, you can be quite puzzled why the scripts are running from the shell program, but not from the web. Of course the I<error_log> will help as well to find out what the problem is, but there can be other obscure cases, where you do something different at the shell program and your scripts refuse to run under the web-server. Another example is when you have more than one version of Perl installed. You might call the first version of the Perl executable in the first script's line (the shebang line), but to have the web-server compiled with another Perl version. Since mod_perl ignores the path to the Perl executable at the first line of the script, you can get quite confused the code won't do the same when processed as request, compared to be executed from the command line. it will take a while before you realize that you test the scripts from the shell program using the I<wrong> Perl version. The best debugging approach is to write a wrapper that emulates the exact environment of the server, first deleting environment variables like C<PERL5LIB> and then calling the same perl binary that it is being used by the server. Next, set the environment identical to the server's by copying the Perl run directives from the server startup and configuration files or even I<require()>'ing the startup file, if it doesn't include C<Apache::> modules stuff, unavailable under shell. This will also allow you to remove completely the first line of the script, since mod_perl doesn't need it anyway and the wrapper knows how to call the script. Here is an example of such a script. Note that we force the use of C<-Tw> when we call the real script. Since when debugging we want to make sure that the code is working when the taint mode is on, and we want to see all the warnings, to help Perl help us have a better code. We have also added the ability to pass parameters, which will not happen when you will issue a request to script, but it can be helpful at times. #!/usr/bin/perl -w # This is a wrapper example # It simulates the web server environment by setting @INC and other # stuff, so what will run under this wrapper will run under Web and # vice versa. # # Usage: wrap.pl some_cgi.pl # BEGIN { # we want to make a complete emulation, so we must reset all the # paths and add the standard Perl libs @INC = qw(/usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 . ); } use strict; use File::Basename; # process the passed params my $cgi = shift || ''; my $params = (@ARGV) ? join(" ", @ARGV) : ''; die "Usage:\n\t$0 some_cgi.pl\n" unless $cgi; # Set the environment my $PERL5LIB = join ":", @INC; # if the path includes the directory # we extract it and chdir there if (index($cgi,'/') >= 0) { my $dirname = dirname($cgi); chdir $dirname or die "Can't chdir to $dirname: $! \n"; $cgi =~ m|$dirname/(.*)|; $cgi = $1; } # run the cgi from the script's directory # Note that we set Warning and Taint modes ON!!! system qq{/usr/bin/perl -I$PERL5LIB -Tw $cgi $params}; =head1 Server Maintenance Chores It's not enough to have your server and service up and running. You have to maintain the server even when everything seems to be fine. This includes security auditing, keeping an eye on the size of remaining unused disk space, available RAM, the load of the system, etc. If you forget about these chores one day (sooner or later) your system will crash either because it has run out of free disk space, all the available CPU has been used and system has started heavily to swap or someone has broken in. Unfortunately the scope of this guide is not covering the latter, since it will take more than one book to profoundly cover this issue, but the rest of the thing are quite easy to prevent if you follow our advices. Certainly, your particular system might have maintenance chores that aren't covered here, but at least you will be alerted that these chores are real and should be taken care of. =head2 Handling Log Files There are two issues to solve with log files. First they should be rotated and compressed on the constant basis, since they tend to use big parts of the disk space over time. Second these should be monitored for possible sudden explosive growth rates, when something goes astray in your code running at the mod_perl server and the process starts to log thousands of error messages in second without stopping, until all the disk space is used, and the server cannot work anymore. =head3 Log Rotation The first issue is solved by having a process run by crontab at certain times (usually off hours, if this term is still valid in the Internet era) and rotate the logs. The log rotation includes the current log file renaming, server restart (which creates a fresh new log file), and renamed file compression and/or moving it on a different disk. For example if we want to rotate the I<access_log> file we could do: % mv access_log access_log.renamed % apachectl restart % sleep 5; # allow all children to complete requests and logging # now it's safe to use access_log.renamed % mv access_log.renamed /some/directory/on/another/disk This is the script that we run from the crontab to rotate the log files: #!/usr/local/bin/perl -Tw # This script does log rotation. Called from crontab. use strict; $ENV{PATH}='/bin:/usr/bin'; ### configuration my @logfiles = qw(access_log error_log); umask 0; my $server = "httpd_perl"; my $logs_dir = "/usr/local/var/$server/logs"; my $restart_command = "/usr/local/sbin/$server/apachectl restart"; my $gzip_exec = "/usr/bin/gzip"; my ($sec,$min,$hour,$mday,$mon,$year) = localtime(time); my $time = sprintf "%0.4d.%0.2d.%0.2d-%0.2d.%0.2d.%0.2d", $year+1900,++$mon,$mday,$hour,$min,$sec; $^I = ".$time"; # rename log files chdir $logs_dir; @ARGV = @logfiles; while (<>) { close ARGV; } # now restart the server so the logs will be restarted system $restart_command; # allow all children to complete requests and logging sleep 5; # compress log files foreach (@logfiles) { system "$gzip_exec $_.$time"; } Note: Setting C<$^I> sets the in-place edit flag to a dot followed by the time. We copy the names of the logfiles into C<@ARGV>, and open each in turn and immediately close them without doing any changes; but because the in-place edit flag is set they are effectively renamed. As you see the rotated files will include the date and the time in their filenames. Here is a more generic set of scripts for log rotation. Cron job fires off setuid script called log-roller that looks like this: #!/usr/bin/perl -Tw use strict; use File::Basename; $ENV{PATH} = "/usr/ucb:/bin:/usr/bin"; my $ROOT = "/WWW/apache"; # names are relative to this my $CONF = "$ROOT/conf/httpd.conf"; # master conf my $MIDNIGHT = "MIDNIGHT"; # name of program in each logdir my ($user_id, $group_id, $pidfile); # will be set during parse of conf die "not running as root" if $>; chdir $ROOT or die "Cannot chdir $ROOT: $!"; my %midnights; open CONF, "<$CONF" or die "Cannot open $CONF: $!"; while (<CONF>) { if (/^User (\w+)/i) { $user_id = getpwnam($1); next; } if (/^Group (\w+)/i) { $group_id = getgrnam($1); next; } if (/^PidFile (.*)/i) { $pidfile = $1; next; } next unless /^ErrorLog (.*)/i; my $midnight = (dirname $1)."/$MIDNIGHT"; next unless -x $midnight; $midnights{$midnight}++; } close CONF; die "missing User definition" unless defined $user_id; die "missing Group definition" unless defined $group_id; die "missing PidFile definition" unless defined $pidfile; open PID, $pidfile or die "Cannot open $pidfile: $!"; <PID> =~ /(\d+)/; my $httpd_pid = $1; close PID; die "missing pid definition" unless defined $httpd_pid and $httpd_pid; kill 0, $httpd_pid or die "cannot find pid $httpd_pid: $!"; for (sort keys %midnights) { defined(my $pid = fork) or die "cannot fork: $!"; if ($pid) { ## parent: waitpid $pid, 0; } else { my $dir = dirname $_; ($(,$)) = ($group_id,$group_id); ($<,$>) = ($user_id,$user_id); chdir $dir or die "cannot chdir $dir: $!"; exec "./$MIDNIGHT"; die "cannot exec $MIDNIGHT: $!"; } } kill 1, $httpd_pid or die "Cannot SIGHUP $httpd_pid: $!"; And then individual C<MIDNIGHT> scripts can look like this: #!/usr/bin/perl -Tw use strict; die "bad guy" unless getpwuid($<) =~ /^(root|nobody)$/; my @LOGFILES = qw(access_log error_log); umask 0; $^I = ".".time; @ARGV = @LOGFILES; while (<>) { close ARGV; } Can you spot the security holes? Take your time... This code shouldn't be used in hostile situations. =head3 Non-Scheduled Emergency Log Rotation As we have mentioned before, there are times when the web server goes wild and starts to log lots of messages to the I<error_log> file non-stop. If no one monitors this, it possible that in a few minutes all the free disk spaces will be filled and no process will be able to work normally. When this happens, the I/O the faulty server causes is so heavy that its sibling processes cannot serve requests. Generally this not the case, but a few people have reported to encounter this problem. If you are one of these people, you should run the monitoring program that checks the log file size and if it notices that the file has grown too large, it should attempt to restart the server and probably trim the log file. When we have used a quite old mod_perl version, sometimes we have had bursts of an error I<Callback called exit> showing up in our I<error_log>. The file could grow to 300 Mbytes in a few minutes. We will show you is an example of the script that should be executed from the crontab, to handle the situations like this. The cron job should run every few minutes or even every minute, since if you experience this problem you know that log files fills up very fast. The example script will rotate when the I<error_log> will grow over 100K. Note that this script is useful when you have the normal scheduled log rotation facility working, remember that this one is an emergency solver and not to be used for routine log rotation. emergency_rotate.sh ------------------- #!/bin/sh S=`ls -s /usr/local/apache/logs/error_log | awk '{print $1}'` if [ "$S" -gt 100000 ] ; then mv /usr/local/apache/logs/error_log /usr/local/apache/logs/error_log.old /etc/rc.d/init.d/httpd restart date | /bin/mail -s "error_log $S kB on inx" [EMAIL PROTECTED] fi Of course you could write a more advanced script, using the timestamps and other whistles. This example comes to illustrate how to solve the problem in question. Another solution is to use an out of box tools that are written for this purpose. The C<daemontools> package (ftp://koobera.math.uic.edu/www/daemontools.html) includes a utility called C<multilog>. This utility saves stdin stream to one or more log files. It optionally timestamps each line and, for each log, includes or excludes lines matching specified patterns. It automatically rotates logs to limit the amount of disk space used. If the disk fills up, it pauses and tries again, without losing any data. The obvious caveat is that it doesn't restart the server, so while it tries to solve the log file handling problem it doesn't handle the originator of the problem. But since the I/O of the log writing process Apache process will be quite heavy, the rest of the servers will work very slowly if at all, and a normal watchdog should detect this abnormal situation and restart the Apache server. =head1 Swapping Prevention Before I delve into swapping process details, let's refresh our knowledge of memory components and memory management The computer memory is called RAM, which stands for Random Access Memory. Reading and writing to RAM is, by a few orders, faster than doing the same operations on a hard disk, the former uses non-movable memory cells, while the latter uses rotating magnetic media. On most operating systems swap memory is used as an extension for RAM and not as a duplication of it. So if your OS is one of those, if you have 128MB of RAM and 256MB swap partition, you have a total of 384MB of memory available. You should never count the extra memory when you decide on the maximum number of processes to be run, and I will show why in the moment. The swapping memory can be built of a number of hard disk partitions and swap files formatted to be used as swap memory. When you need more swap memory you can always extend it on demand as long as you have some free disk space (for more information see the I<mkswap> and I<swapon> manpages). System memory is quantified in units called memory pages. Usually the size of a memory page is between 1KB and 8KB. So if you have 256MB of RAM installed on your machine and the page size is 4KB your system has 64,000 main memory pages to work with and these pages are fast. If you have 256MB swap partition the system can use yet another 64,000 memory pages, but they are much slower. When the system is started all memory pages are available for use by the programs (processes). Unless the program is really small, the process running this program uses only a few segments of the program, each segment mapped onto its own memory page. Therefore only a few memory pages are required to be loaded into the memory. When the process needs an additional program's segment to be loaded into the memory, it asks the system whether the page containing this segment is already loaded in the memory. If the page is not found--an event know as a I<page fault> occurs, which requires the system to allocate a free memory page, go to the disk, read and load the requested program's segment into the allocated memory page. If a process needs to bring a new page into physical memory and there are no free physical pages available, the operating system must make room for this page by discarding another page from physical memory. If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file. However, if the page has been modified, the operating system must preserve the contents of that page so that it can be accessed at a later time. This type of page is known as a I<dirty page> and when it is removed from memory it is saved in a special sort of file called the swap file. This process is referred to as a I<swapping out>. Accesses to the swap file are very long relative to the speed of the processor and physical memory and the operating system must juggle the need to write pages to disk with the need to retain them in memory to be used again. In order to improve the swapping out process, to decrease the possibility that the page that has just been swapped out, will be needed at the next moment, the LRU (least recently used) or a similar algorithm is used. To summarize the two swapping scenarios, read-only pages discarding incurs no overhead in contrast with the discarding scenario of the data pages that have been written to, since in the latter case the pages have to be written to a swap partition located on the slow disk. Therefore your machine's overall performance will be much better if there will be less memory pages that can become dirty. But the problem is, Perl is a language with no strong data types, which means that both the program code and the program data are seen as a data pages by OS since both mapped to the same memory pages. Therefore a big chunk of your Perl code becomes dirty when its variables are modified and when the pages need to be discarded they have to be written to the swap partition. This leads us to two important conclusions about swapping and Perl. =over =item * Running your system when there is no free main memory available hinders performance, because processes memory pages should be discarded and then reread from disk again and again. =item * Since a majority of the running code is a Perl code, in addition to the overhead of reading the previously discarded pages in, the overhead of saving the dirty pages to the swap partition is occurring. =back When the system has to swap memory pages in and out, the system slows down, not serving the processes as fast as before. This leads to an accumulation of processes waiting for their turn to run, which further causes processing demands to go up, which in turn slows down the system even more as more memory is required. This ever worsening spiral will lead the machine to halt, unless the resource demand suddenly drops down and allows the processes to catch up with their tasks and go back to normal memory usage. In addition it's important to know that for a better performance, most programs, particularly programs written in Perl, on most modern OSs don't return memory pages while they are running. If some of the memory gets freed it's reused when needed by the process, without creating the additional overhead of asking the system to allocate new memory pages. That's why you will observe that Perl programs grow in size as they run and almost never shrink. When the process quits it returns its memory pages to the pool of freely available pages for other processes to use. This scenario is certainly educating, and it should be now obvious that your system that runs the web server should never swap. It's absolutely normal for your desktop to start swapping. You will see it immediately since things will slow down and sometimes the system will freeze for a short periods. But as I've just mentioned, you can stop starting new programs and can quit some, thus allowing the system to catch up with the load and come back to use the RAM. In the case of the web server you have much less control since it's users who load your machine by issuing requests to your server. Therefore you should configure the server, so that the maximum number of possible processes will be small enough using the C<MaxClients> directive (For the technique for choosing the right C<MaxClients> refer to the section 'L<Choosing MaxClients|guide::performance/Choosing_MaxClients>'). This will ensure that at peak hours the system won't swap. Remember that swap space is an emergency pool, not a resource to be used routinely. If you are low on memory and you badly need it, buy it or reduce the number of processes to prevent swapping. However sometimes, due to the faulty code, some process might start spinning in an unconstrained loop, consuming all the available RAM and starting to heavily use swap memory. In such a situation it helps when you have a big emergency pool (i.e. lots of swap memory). But you have to resolve this problem as soon as possible since this pool won't last for a long time. In the meanwhile the C<Apache::Resource> module can be handy. For swapping monitoring techniques see the section 'L<Apache::VMonitor -- Visual System and Apache Server Monitor|guide::debug/Apache__VMonitor____Visual_System_and_Apache_Server_Monitor>'. =head1 Preventing mod_perl Processes From Going Wild Sometimes people report that they had a problem with their code running under mod_perl that has caused all the RAM or all the disk to be used. The following tips should help you prevent these problems, before if at all they hit you. =head2 All RAM Consumed Sometimes calling an undefined subroutine in a module can cause a tight loop that consumes all the available memory. Here is a way to catch such errors. Define an C<UNIVERSAL::AUTOLOAD> subroutine in your I<startup.pl>, or in a E<lt>PerlE<gt>E<lt>/PerlE<gt> section in your I<httpd.conf> file: sub UNIVERSAL::AUTOLOAD { my $class = shift; warn "$class can't \$UNIVERSAL::AUTOLOAD=$UNIVERSAL::AUTOLOAD!\n"; } You can either put it in your startup.pl, or in a C<E<lt>PerlE<gt>E<lt>/PerlE<gt>> section in your httpd.conf file. I do the latter. Putting it in all your mod_perl modules would be redundant (and might give you compile-time errors). This will produce a nice error in I<error_log>, giving the line number of the call and the name of the undefined subroutine. =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back =head1 Authors =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut 1.1 modperl-docs/src/docs/general/hardware.pod Index: hardware.pod =================================================================== =head1 NAME Choosing an Operating System and Hardware =head1 Description Before you use the techniques documented on this site to tune servers and write code you need to consider the demands which will be placed on the hardware and the operating system. There is no point in investing a lot of time and money in configuration and coding only to find that your server's performance is poor because you did not choose a suitable platform in the first place. While the tips below could apply to many web servers, they are aimed primarily at administrators of mod_perl enabled Apache server. Because hardware platforms and operating systems are developing rapidly (even while you are reading this document), this discussion must be in general terms. =head1 Choosing an Operating System First let's talk about Operating Systems (OSs). Most of the time I prefer to use Linux or something from the *BSD family. Although I am personally a Linux devotee, I do not want to start yet another OS war. I will try to talk about what characteristics and features you should be looking for to support an Apache/mod_perl server, then when you know what you want from your OS, you can go out and find it. Visit the Web sites of the operating systems you are interested in. You can gauge user's opinions by searching the relevant discussions in newsgroups and mailing list archives. Deja - http://deja.com and eGroups - http://egroups.com are good examples. I will leave this fan research to the reader. =head2 Stability and Robustness Probably the most important features in an OS are stability and robustness. You are in an Internet business. You do not keep normal 9am to 5pm working hours like many conventional businesses you know. You are open 24 hours a day. You cannot afford to be off-line, for your customers will go shop at another service like yours (unless you have a monopoly :). If the OS of your choice crashes every day, first do a little investigation. There might be a simple reason which you can find and fix. There are OSs which won't work unless you reboot them twice a day. You don't want to use the OS of this kind, no matter how good the OS' vendor sales department. Do not follow flushy advertisements, follow developers advices instead. Generally, people who have used the OS for some time can tell you a lot about its stability. Ask them. Try to find people who are doing similar things to what you are planning to do, they may even be using the same software. There are often compatibility issues to resolve. You may need to become familiar with patching and compiling your OS. It's easy. =head2 Memory Management You want an OS with a good memory management, some OSs are well known as memory hogs. The same code can use twice as much memory on one OS compared to another. If the size of the mod_perl process is 10Mb and you have tens of these running, it definitely adds up! =head2 Memory Leaks Some OSs and/or their libraries (e.g. C runtime libraries) suffer from memory leaks. A leak is when some process requests a chunk of memory for temporary storage, but then does not subsequently release it. The chunk of memory is not then available for any purpose until the process which requested it dies. We cannot afford such leaks. A single mod_perl process sometimes serves thousands of requests before it terminates. So if a leak occurs on every request, the memory demands could become huge. Of course our code can be the cause of the memory leaks as well (check out the C<Apache::Leak> module on CPAN). Certainly, we can reduce the number of requests to be served over the process' life, but that can degrade performance. =head2 Sharing Memory We want an OS with good memory sharing capabilities. As we have seen, if we preload the modules and scripts at server startup, they are shared between the spawned children (at least for a part of a process' life - memory pages can become "dirty" and cease to be shared). This feature can reduce memory consumption a lot! =head2 Cost and Support If we are in a big business we probably do not mind paying another $1000 for some fancy OS with bundled support. But if our resources are low, we will look for cheaper and free OSs. Free does not mean bad, it can be quite the opposite. Free OSs can have the best support we can find. Some do. It is very easy to understand - most of the people are not rich and will try to use a cheaper or free OS first if it does the work for them. Since it really fits their needs, many people keep using it and eventually know it well enough to be able to provide support for others in trouble. Why would they do this for free? One reason is for the spirit of the first days of the Internet, when there was no commercial Internet and people helped each other, because someone helped them in first place. I was there, I was touched by that spirit and I am keen to keep that spirit alive. But, let's get back to our world. We are living in material world, and our bosses pay us to keep the systems running. So if you feel that you cannot provide the support yourself and you do not trust the available free resources, you must pay for an OS backed by a company, and blame them for any problem. Your boss wants to be able to sue someone if the project has a problem caused by the external product that is being used in the project. If you buy a product and the company selling it claims support, you have someone to sue or at least to put the blame on. If we go with Open Source and it fails we do not have someone to sue... wrong--in the last years many companies have realized how good the Open Source products are and started to provide an official support for these products. So your boss cannot just dismiss your suggestion of using an Open Source Operating System. You can get a paid support just like with any other commercial OS vendor. Also remember that the less money you spend on OS and Software, the more you will be able to spend on faster and stronger hardware. =head2 Discontinued Products The OSs in this hazard group tend to be developed by a single company or organization. You might find yourself in a position where you have invested a lot of time and money into developing some proprietary software that is bundled with the OS you chose (say writing a mod_perl handler which takes advantage of some proprietary features of the OS and which will not run on any other OS). Things are under control, the performance is great and you sing with happiness on your way to work. Then, one day, the company which supplies your beloved OS goes bankrupt (not unlikely nowadays), or they produce a newer incompatible version and they will not support the old one (happens all the time). You are stuck with their early masterpiece, no support and no source code! What are you going to do? Invest more money into porting the software to another OS... Everyone can be hit by this mini-disaster so it is better to check the background of the company when making your choice. Even so you never know what will happen tomorrow - in 1980, a company called Tektronix did something similar to one of the Guide reviewers with its microprocessor development system. The guy just had to buy another system. He didn't buy it from Tektronix, of course. The second system never really worked very well and the firm he bought it from went bust before they ever got around to fixing it. So in 1982 he wrote his own microprocessor development system software. It didn't take long, it works fine, and he's still using it 18 years later. Free and Open Source OSs are probably less susceptible to this kind of problem. Development is usually distributed between many companies and developers, so if a person who developed a really important part of the kernel lost interest in continuing, someone else will pick the falling flag and carry on. Of course if tomorrow some better project shows up, developers might migrate there and finally drop the development: but in practice people are often given support on older versions and helped to migrate to current versions. Development tends to be more incremental than revolutionary, so upgrades are less traumatic, and there is usually plenty of notice of the forthcoming changes so that you have time to plan for them. Of course with the Open Source OSs you can have the source! So you can always have a go yourself, but do not under-estimate the amounts of work involved. There are many, many man-years of work in an OS. =head2 OS Releases Actively developed OSs generally try to keep pace with the latest technology developments, and continually optimize the kernel and other parts of the OS to become better and faster. Nowadays, Internet and networking in general are the hottest topics for system developers. Sometimes a simple OS upgrade to the latest stable version can save you an expensive hardware upgrade. Also, remember that when you buy new hardware, chances are that the latest software will make the most of it. If a new product supports an old one by virtue of backwards compatibility with previous products of the same family, you might not reap all the benefits of the new product's features. Perhaps you get almost the same functionality for much less money if you were to buy an older model of the same product. =head1 Choosing Hardware Sometimes the most expensive machine is not the one which provides the best performance. Your demands on the platform hardware are based on many aspects and affect many components. Let's discuss some of them. In the discussion we use terms that may be unfamiliar to some readers: =over 4 =item * Cluster - a group of machines connected together to perform one big or many small computational tasks in a reasonable time. Clustering can also be used to provide 'fail-over' where if one machine fails its processes are transferred to another without interruption of service. And you may be able to take one of the machines down for maintenance (or an upgrade) and keep your service running - the main server will simply not dispatch the requests to the machine that was taken down. =item * Load balancing - users are given the name of one of your machines but perhaps it cannot stand the heavy load. You can use a clustering approach to distribute the load over a number of machines. The central server, which users access initially when they type the name of your service, works as a dispatcher. It just redirects requests to other machines. Sometimes the central server also collects the results and returns them to the users. You can get the advantages of clustering too. There are many load balancing techniques. (See L<High-Availability Linux Project|guide::download/High_Availability_Linux_Project> for more info.) =item * NIC - Network Interface Card. A hardware component that allows to connect your machine to the network. It performs packets sending and receiving, newer cards can encrypt and decrypt packets and perform digital signing and verifying of the such. These are coming in different speeds categories varying from 10Mbps to 10Gbps and faster. The most used type of the NIC card is the one that implements the Ethernet networking protocol. =item * RAM - Random Access Memory. It's the memory that you have in your computer. (Comes in units of 8Mb, 16Mb, 64Mb, 256Mb, etc.) =item * RAID - Redundant Array of Inexpensive Disks. An array of physical disks, usually treated by the operating system as one single disk, and often forced to appear that way by the hardware. The reason for using RAID is often simply to achieve a high data transfer rate, but it may also be to get adequate disk capacity or high reliability. Redundancy means that the system is capable of continued operation even if a disk fails. There are various types of RAID array and several different approaches to implementing them. Some systems provide protection against failure of more than one drive and some (`hot-swappable') systems allow a drive to be replaced without even stopping the OS. See for example the Linux `HOWTO' documents Disk-HOWTO, Module-HOWTO and Parallel-Processing-HOWTO. =back =head2 Machine Strength Demands According to Expected Site Traffic If you are building a fan site and you want to amaze your friends with a mod_perl guest book, any old 486 machine could do it. If you are in a serious business, it is very important to build a scalable server. If your service is successful and becomes popular, the traffic could double every few days, and you should be ready to add more resources to keep up with the demand. While we can define the webserver scalability more precisely, the important thing is to make sure that you can add more power to your webserver(s) without investing much additional money in software development (you will need a little software effort to connect your servers, if you add more of them). This means that you should choose hardware and OSs that can talk to other machines and become a part of a cluster. On the other hand if you prepare for a lot of traffic and buy a monster to do the work for you, what happens if your service doesn't prove to be as successful as you thought it would be? Then you've spent too much money, and meanwhile faster processors and other hardware components have been released, so you lose. Wisdom and prophecy, that's all it takes :) =head3 Single Strong Machine vs Many Weaker Machines Let's start with a claim that a four years old processor is still very powerful and can be put to a good use. Now let's say that for a given amount of money you can probably buy either one new very strong machine or about ten older but very cheap machines. I claim that with ten old machines connected into a cluster and by deploying load balancing you will be able to serve about five times more requests than with one single new machine. Why is that? Because generally the performance improvement on a new machine is marginal while the price is much higher. Ten machines will do faster disk I/O than one single machine, even if the new disk is quite a bit faster. Yes, you have more administration overhead, but there is a chance you will have it anyway, for in a short time the new machine you have just bought might not stand the load. Then you will have to purchase more equipment and think about how to implement load balancing and web server file system distribution anyway. Why I'm so convinced? Look at the busiest services on the Internet: search engines, web-email servers and the like -- most of them use a clustering approach. You may not always notice it, because they hide the real implementation behind proxy servers. =head2 Internet Connection You have the best hardware you can get, but the service is still crawling. Make sure you have a fast Internet connection. Not as fast as your ISP claims it to be, but fast as it should be. The ISP might have a very good connection to the Internet, but put many clients on the same line. If these are heavy clients, your traffic will have to share the same line and your throughput will suffer. Think about a dedicated connection and make sure it is truly dedicated. Don't trust the ISP, check it! The idea of having a connection to B<The Internet> is a little misleading. Many Web hosting and co-location companies have large amounts of bandwidth, but still have poor connectivity. The public exchanges, such as MAE-East and MAE-West, frequently become overloaded, yet many ISPs depend on these exchanges. Private peering means that providers can exchange traffic much quicker. Also, if your Web site is of global interest, check that the ISP has good global connectivity. If the Web site is going to be visited mostly by people in a certain country or region, your server should probably be located there. Bad connectivity can directly influence your machine's performance. Here is a story one of the developers told on the mod_perl mailing list: What relationship has 10% packet loss on one upstream provider got to do with machine memory ? Yes.. a lot. For a nightmare week, the box was located downstream of a provider who was struggling with some serious bandwidth problems of his own... people were connecting to the site via this link, and packet loss was such that retransmits and tcp stalls were keeping httpd heavies around for much longer than normal.. instead of blasting out the data at high or even modem speeds, they would be stuck at 1k/sec or stalled out... people would press stop and refresh, httpds would take 300 seconds to timeout on writes to no-one.. it was a nightmare. Those problems didn't go away till I moved the box to a place closer to some decent backbones. Note that with a proxy, this only keeps a lightweight httpd tied up, assuming the page is small enough to fit in the buffers. If you are a busy internet site you always have some slow clients. This is a difficult thing to simulate in benchmark testing, though. =head2 I/O Performance If your service is I/O bound (does a lot of read/write operations to disk) you need a very fast disk, especially if the you need a relational database, which are the main I/O stream creators. So you should not spend the money on Video card and monitor! A cheap card and a 14" monochrome monitor are perfectly adequate for a Web server, you will probably access it by C<telnet> or C<ssh> most of the time. Look for disks with the best price/performance ratio. Of course, ask around and avoid disks that have a reputation for headcrashes and other disasters. You must think about RAID or similar systems if you have an enormous data set to serve (what is an enormous data set nowadays? Gigabytes, Terabytes?) or you expect a really big web traffic. Ok, you have a fast disk, what's next? You need a fast disk controller. There may be one embedded on your computer's motherboard. If the controller is not fast enough you should buy a faster one. Don't forget that it may be necessary to disable the original controller. =head2 Memory Memory should be well tested. Many memory test programs are practically useless. Running a busy system for a few weeks without ever shutting it down is a pretty good memory test. If you increase the amount of RAM on a well-tested box, use well-tested RAM. How much RAM do you need? Nowadays, the chances are that you will hear: "Memory is cheap, the more you buy the better". But how much is enough? The answer is pretty straightforward: I<you do not want your machine to swap>. When the CPU needs to write something into memory, but memory is already full, it takes the least frequently used memory pages and swaps them out to disk. This means you have to bear the time penalty of writing the data to disk. If another process then references some of the data which happens to be on one of the pages that has just been swapped out, the CPU swaps it back in again, probably swapping out some other data that will be needed very shortly by some other process. Carried to the extreme, the CPU and disk start to I<thrash> hopelessly in circles, without getting any real work done. The less RAM there is, the more often this scenario arises. Worse, you can exhaust swap space as well, and then your troubles really start... How do you make a decision? You know the highest rate at which your server expects to serve pages and how long it takes on average to serve one. Now you can calculate how many server processes you need. If you know the maximum size your servers can grow to, you know how much memory you need. If your OS supports L<memory sharing|guide::hardware/Sharing_Memory>, you can make best use of this feature by preloading the modules and scripts at server startup, and so you will need less memory than you have calculated. Do not forget that other essential system processes need memory as well, so you should plan not only for the Web server, but also take into account the other players. Remember that requests can be queued, so you can afford to let your client wait for a few moments until a server is available to serve it. Most of the time your server will not have the maximum load, but you should be ready to bear the peaks. You need to reserve at least 20% of free memory for peak situations. Many sites have crashed a few moments after a big scoop about them was posted and an unexpected number of requests suddenly came in. (This is called the Slashdot effect, which was born at http://slashdot.org ). If you are about to announce something cool, be aware of the possible consequences. =head2 CPU Make sure that the CPU is operating within its specifications. Many boxes are shipped with incorrect settings for CPU clock speed, power supply voltage etc. Sometimes a cooling fan is not fitted. It may be ineffective because a cable assembly fouls the fan blades. Like faulty RAM, an overheating processor can cause all kinds of strange and unpredictable things to happen. Some CPUs are known to have bugs which can be serious in certain circumstances. Try not to get one of them. =head2 Bottlenecks You might use the most expensive components, but still get bad performance. Why? Let me introduce an annoying word: bottleneck. A machine is an aggregate of many components. Almost any one of them may become a bottleneck. If you have a fast processor but a small amount of RAM, the RAM will probably be the bottleneck. The processor will be under-utilized, usually it will be waiting for the kernel to swap the memory pages in and out, because memory is too small to hold the busiest pages. If you have a lot of memory, a fast processor, a fast disk, but a slow disk controller, the disk controller will be the bottleneck. The performance will still be bad, and you will have wasted money. Use a fast NIC that does not create a bottleneck. They are cheap. If the NIC is slow, the whole service is slow. This is a most important component, since webservers are much more often network-bound than they are disk-bound! =head3 Solving Hardware Requirement Conflicts It may happen that the combination of software components which you find yourself using gives rise to conflicting requirements for the optimization of tuning parameters. If you can separate the components onto different machines you may find that this approach (a kind of clustering) solves the problem, at much less cost than buying faster hardware, because you can tune the machines individually to suit the tasks they should perform. For example if you need to run a relational database engine and mod_perl server, it can be wise to put the two on different machines, since while RDBMS need a very fast disk, mod_perl processes need lots of memory. So by placing the two on different machines it's easy to optimize each machine at separate and satisfy the each software components requirements in the best way. =head2 Conclusion To use your money optimally you have to understand the hardware very well, so you will know what to pick. Otherwise, you should hire a knowledgeable hardware consultant and employ them on a regular basis, since your needs will probably change as time goes by and your hardware will likewise be forced to adapt as well. =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back =head1 Authors =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut 1.1 modperl-docs/src/docs/general/multiuser.pod Index: multiuser.pod =================================================================== =head1 NAME mod_perl for ISPs. mod_perl and Virtual Hosts =head1 Description mod_perl hosting by ISPs: fantasy or reality? This section covers some topics that might be of interest to users looking for ISPs to host their mod_perl-based website, and ISPs looking for a way to provide such services. Today, it is a reality: there are a number of ISPs hosting mod_perl, although the number of these is not as big as we would have liked it to be. To see a list of ISPs that can provide mod_perl hosting, see L<ISPs supporting mod_perl|help::isps>. =head1 ISPs providing mod_perl services - a fantasy or a reality =over 4 =item * You installed mod_perl on your box at home, and you fell in love with it. So now you want to convert your CGI scripts (which currently are running on your favorite ISPs machine) to run under mod_perl. Then you discover that your ISP has never heard of mod_perl, or he refuses to install it for you. =item * You are an old sailor in the ISP business, you have seen it all, you know how many ISPs are out there and you know that the sales margins are too low to keep you happy. You are looking for some new service almost no one else provides, to attract more clients to become your users and hopefully to have a bigger slice of the action than your competitors. =back If you are a user asking for a mod_perl service or an ISP considering to provide this service, this section should make things clear for both of you. An ISP has three choices: =over 4 =item 1 ISPs probably cannot let users run scripts under mod_perl on the main server. There are many reasons for this: Scripts might leak memory, due to sloppy programming. There will not be enough memory to run as many servers as required, and clients will be not satisfied with the service because it will be slower. The question of file permissions is a very important issue: any user who is allowed to write and run a CGI script can at least read (if not write) any other files that belong to the same user and/or group the web server is running as. Note that L<it's impossible to run C<suEXEC> and C<cgiwrap> extensions under mod_perl 1.x|guide::install/Is_it_possible_to_run_mod_perl_enabled_Apache_as_suExec_>. Another issue is the security of the database connections. If you use C<Apache::DBI>, by hacking the C<Apache::DBI> code you can pick a connection from the pool of cached connections even if it was opened by someone else and your scripts are running on the same web server. Yet another security issue is a potential compromise of the systems via user's code running on the webservers. One of the possible solutions here is to use chroot(1) or jail(8) mechanisms which allow to run subsystems isolated from the main system. So if a subsystem gets compromised the whole system is still safe. There are many more things to be aware of so at this time you have to say I<No>. Of course as an ISP you can run mod_perl internally, without allowing your users to map their scripts so that they will run under mod_perl. If as a part of your service you provide scripts such as guest books, counters etc. which are not available for user modification, you can still can have these scripts running very fast. =item 2 But, hey why can't I let my users run their own servers, so I can wash my hands of them and don't have to worry about how dirty and sloppy their code is (assuming that the users are running their servers under their own usernames, to prevent them from stealing code and data from each other). This option is fine as long as you are not concerned about your new systems resource requirements. If you have even very limited experience with mod_perl, you know that mod_perl enabled Apache servers while freeing up your CPU and allowing you to run scripts very much faster, have huge memory demands (5-20 times that of plain Apache). The size depends on the code length, the sloppiness of the programming, possible memory leaks the code might have and all that multiplied by the number of children each server spawns. A very simple example: a server, serving an average number of scripts, demanding 10Mb of memory which spawns 10 children, already raises your memory requirements by 100Mb (the real requirement is actually much smaller if your OS allows code sharing between processes and programmers exploit these features in their code). Now multiply the average required size by the number of server users you intend to have and you will get the total memory requirement. Since ISPs never say I<No>, you'd better take the inverse approach - think of the largest memory size you can afford then divide it by one user's requirements as I have shown in this example, and you will know how many mod_perl users you can afford :) But you cannot tell how much memory your users may use? Their requirements from a single server can be very modest, but do you know how many servers they will run? After all, they have full control of I<httpd.conf> - and it has to be this way, since this is essential for the user running mod_perl. All this rumbling about memory leads to a single question: is it possible to prevent users from using more than X memory? Or another variation of the question: assuming you have as much memory as you want, can you charge users for their average memory usage? If the answer to either of the above questions is I<Yes>, you are all set and your clients will prize your name for letting them run mod_perl! There are tools to restrict resource usage (see for example the man pages for C<ulimit(3)>, C<getrlimit(2)>, C<setrlimit(2)> and C<sysconf(3)>, the last three have the corresponding Perl modules: C<BSD::Resource> and C<Apache::Resource>). [ReaderMETA]: If you have experience with other resource limiting techniques please share it with us. Thank you! If you have chosen this option, you have to provide your client with: =over 4 =item * Shutdown and startup scripts installed together with the rest of your daemon startup scripts (e.g I</etc/rc.d> directory), so that when you reboot your machine the user's server will be correctly shutdown and will be back online the moment your system starts up. Also make sure to start each server under the username the server belongs to, or you are going to be in big trouble! =item * Proxy services (in forward or httpd accelerator mode) for the user's virtual host. Since the user will have to run their server on an unprivileged port (E<gt>1024), you will have to forward all requests from C<user.given.virtual.hostname:80> (which is C<user.given.virtual.hostname> without the default port 80) to C<your.machine.ip:port_assigned_to_user> . You will also have to tell the users to code their scripts so that any self referencing URLs are of the form C<user.given.virtual.hostname>. Letting the user run a mod_perl server immediately adds a requirement for the user to be able to restart and configure their own server. Only root can bind to port 80, this is why your users have to use port numbers greater than 1024. Another solution would be to use a setuid startup script, but think twice before you go with it, since if users can modify the scripts they will get a root access. For more information refer to the section "L<SUID Start-up Scripts|guide::control/SUID_Start_up_Scripts>". =item * Another problem you will have to solve is how to assign ports between users. Since users can pick any port above 1024 to run their server, you will have to lay down some rules here so that multiple servers do not conflict. A simple example will demonstrate the importance of this problem: I am a malicious user or I am just a rival of some fellow who runs his server on your ISP. All I need to do is to find out what port my rival's server is listening to (e.g. using C<netstat(8)>) and configure my own server to listen on the same port. Although I am unable to bind to this port, imagine what will happen when you reboot your system and my startup script happens to be run before my rivals! I get the port first, now all requests will be redirected to my server. I'll leave to your imagination what nasty things might happen then. Of course the ugly things will quickly be revealed, but not before the damage has been done. =back Basically you can preassign each user a port, without them having to worry about finding a free one, as well as enforce C<MaxClients> and similar values by implementing the following scenario: For each user have two configuration files, the main file, I<httpd.conf> (non-writable by user) and the user's file, I<username.httpd.conf> where they can specify their own configuration parameters and override the ones defined in I<httpd.conf>. Here is what the main configuration file looks like: httpd.conf ---------- # Global/default settings, the user may override some of these ... ... # Included so that user can set his own configuration Include username.httpd.conf # User-specific settings which will override any potentially # dangerous configuration directives in username.httpd.conf ... ... username.httpd.conf ------------------- # Settings that your user would like to add/override, # like <Location> and PerlModule directives, etc. Apache reads the global/default settings first. Then it reads the I<Include>'d I<username.httpd.conf> file with whatever settings the user has chosen, and finally it reads the user-specific settings that we don't want the user to override, such as the port number. Even if the user changes the port number in his I<username.httpd.conf> file, Apache reads our settings last, so they take precedence. Note that you can use L<Perl sections|guide::config/Apache_Configuration_in_Perl> to make the configuration much easier. =item 3 A much better, but costly solution is I<co-location>. Let the user hook his (or your) stand-alone machine into your network, and forget about this user. Of course either the user or you will have to undertake all the system administration chores and it will cost your client more money. Who are the people who seek mod_perl support? They are people who run serious projects/businesses. Money is not usually an obstacle. They can afford a stand alone box, thus achieving their goal of autonomy whilst keeping their ISP happy. =back =head2 Virtual Servers Technologies As we have just seen one of the obstacles of using mod_perl in ISP environments, is the problem of isolating customers using the same machine from each other. A number of virtual servers (don't confuse with virtual hosts) technologies (both commercial and Open Source) exist today. Here are some of them: =over =item * The User-mode Linux Kernel http://user-mode-linux.sourceforge.net/ User-Mode Linux is a safe, secure way of running Linux versions and Linux processes. Run buggy software, experiment with new Linux kernels or distributions, and poke around in the internals of Linux, all without risking your main Linux setup. User-Mode Linux gives you a virtual machine that may have more hardware and software virtual resources than your actual, physical computer. Disk storage for the virtual machine is entirely contained inside a single file on your physical machine. You can assign your virtual machine only the hardware access you want it to have. With properly limited access, nothing you do on the virtual machine can change or damage your real computer, or its software. So if you want to completely protect one user from another and yourself from your users this might be yet another alternative to the solutions suggested at the beginning of this chapter. =item * VMWare Technology Allows running a few instances of the same or different OSs on the same machine. This technology comes in two flavors: Open source: http://www.plex86.org/ Commercial: http://www.vmware.com/ So you may want to run a separate OS for each of your clients =item * freeVSD Technology freeVSD (http://www.freevsd.org), an open source project sponsored by Idaya Ltd. The software enables ISPs to securely partition their physical servers into many I<virtual servers>, each capable of running popular hosting applications such as Apache, Sendmail and MySQL. =item * S/390 IBM server Quoting from: http://www.s390.ibm.com/linux/vif/ "The S/390 Virtual Image Facility enables you to run tens to hundreds of Linux server images on a single S/390 server. It is ideally suited for those who want to move Linux and/or UNIX workloads deployed on multiple servers onto a single S/390 server, while maintaining the same number of distinct server images. This provides centralized management and operation of the multiple image environment, reducing complexity, easing administration and lowering costs." In two words, this a great solution to huge ISPs, as it allows you to run hundreds of mod_perl servers while having only one box to maintain. The drawback is the price :) Check out this scalable mailing list thread for more details from those who know: http://archive.develooper.com/[EMAIL PROTECTED]/msg00235.html =back =head1 Virtual Hosts in the guide If you are about to use I<Virtual Hosts> you might want to read these sections: L<Apache Configuration in Perl|guide::config/Apache_Configuration_in_Perl> L<Easing the Chores of Configuring Virtual Hosts with mod_macro|guide::config/Configuring_Apache___mod_perl_with_mod_macro> L<Is There a Way to Provide a Different startup.pl File for Each Individual Virtual Host|guide::config/Is_There_a_Way_to_Provide_a_Different_startup_pl_File_for_Each_Individual_Virtual_Host> L<Is There a Way to Modify @INC on a Per-Virtual-Host or Per-Location Basis.|guide::config/Is_There_a_Way_to_Modify__INC_on_a_Per_Virtual_Host_or_Per_Location_Basis_> L<A Script From One Virtual Host Calls a Script with the Same Path From the Other Virtual Host|guide::config/A_Script_From_One_Virtual_Host_Calls_a_Script_with_the_Same_Path_From_the_Other_Virtual_Host> =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back =head1 Authors =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut 1.1 modperl-docs/src/docs/general/perl_myth.pod Index: perl_myth.pod =================================================================== =head1 NAME Popular Perl Complaints and Myths =head1 Description This document tries to explain the myths about Perl and overturn the FUD certain bodies try to spread. =head1 Abbreviations =over 4 =item * B<M> = Misconception or Myth =item * B<R> = Response =back =head2 Interpreted vs. Compiled =over 4 =item M: Each dynamic perl page hit needs to load the Perl interpreter and compile the script, then run it each time a dynamic web page is hit. This dramatically decreases performance as well as makes Perl an unscalable model since so much overhead is required to search each page. =item R: This myth was true years ago before the advent of mod_perl. mod_perl loads the interpreter once into memory and never needs to load it again. Each perl program is only compiled once. The compiled version is then kept into memory and used each time the program is run. In this way there is no extra overhead when hitting a mod_perl page. =back =head3 Interpreted vs. Compiled (More Gory Details) =over 4 =item R: Compiled code always has the potential to be faster than interpreted code. Ultimately, all interpreted code needs to eventually be converted to native instructions at some point, and this is invariably has to be done by a compiled application. That said, an interpreted language CAN be faster than a comprable native application in certain situations, given certain, common programming practices. For example, the allocation and de-allocation of memory can be a relatively expensive process in a tightly scoped compiled language, wheras interpreted languages typically use garbage collectors which don't need to do expensive deallocation in a tight loop, instead waiting until additional memory is absolutely necessary, or for a less computationally intensive period. Of course, using a garbage collector in C would eliminate this edge in this situation, but where using garbage collectors in C is uncommon, Perl and most other interpreted languages have built-in garbage collectors. It is also important to point out that few people use the full potential of their modern CPU with a single application. Modern CPUs are not only more than fast enough to run interpreted code, many processors include instruction sets designed to increase the performance of interpreted code. =back =head2 Perl is overly memory intensive making it unscalable =over 4 =item M: Each child process needs the Perl interpreter and all code in memory. Even with mod_perl httpd processes tend to be overly large, slowing performance, and requiring much more hardware. =item R: In mod_perl the interpreter is loaded into the parent process and shared between the children. Also, when scripts are loaded into the parent and the parent forks a child httpd process, that child shares those scripts with the parent. So while the child may take 6MB of memory, 5MB of that might be shared meaning it only really uses 1MB per child. Even 5 MB of memory per child is not uncommon for most web applications on other languages. Also, most modern operating systems support the concept of shared libraries. Perl can be compiled as a shared library, enabling the bulk of the perl interpreter to be shared between processes. Some executable formats on some platforms (I believe ELF is one such format) are able to share entire executable TEXT segments between unrelated processes. =back =head3 More Tuning Advice: =over 4 =item * L<Vivek Khera's mod_perl performance tuning guide|faqs::mod_perl_tuning> =item * L<Stas Bekman's Performance Guide|guide::performance> =back =head2 Not enough support, or tools to develop with Perl. (Myth) =over 4 =item R: Of all web applications and languages, Perl arguable has the most support and tools. B<CPAN> is a central repository of Perl modules which are freely downloadable and usually well supported. There are literally thousands of modules which make building web apps in Perl much easier. There are also countless mailing lists of extremely responsive Perl experts who usually respond to questions within an hour. There are also a number of Perl development environments to make building Perl Web applications easier. Just to name a few, there is C<Apache::ASP>, C<Mason>, C<embPerl>, C<ePerl>, etc... =back =head2 If Perl scales so well, how come no large sites use it? (myth) =over 4 =item R: Actually, many large sites DO use Perl for the bulk of their web applications. Here are some, just as an example: B<e-Toys>, B<CitySearch>, B<Internet Movie Database>( http://imdb.com ), B<Value Click> ( http://valueclick.com ), B<Paramount Digital Entertainment>, B<CMP> ( http://cmpnet.com ), B<HotBot Mail>/B<HotBot Homepages>, and B<DejaNews> to name a few. Even B<Microsoft> has taken interest in Perl via http://www.activestate.com/. =back =head2 Perl even with mod_perl, is always slower then C. =over 4 =item R: The Perl engine is written in C. There is no point arguing that Perl is faster than C because anything written in Perl could obviously be re-written in C. The same holds true for arguing that C is faster than assembly. There are two issues to consider here. First of all, many times a web application written in Perl B<CAN be faster> than C thanks to the low level optimizations in the Perl compiler. In other words, its easier to write poorly written C then well written Perl. Secondly its important to weigh all factors when choosing a language to build a web application in. Time to market is often one of the highest priorities in creating a web application. Development in Perl can often be twice as fast as in C. This is mostly due to the differences in the language themselves as well as the wealth of free examples and modules which speed development significantly. Perl's speedy development time can be a huge competitive advantage. =back =head2 Java does away with the need for Perl. =over 4 =item M: Perl had its place in the past, but now there's Java and Java will kill Perl. =item R: Java and Perl are actually more complimentary languages then competitive. Its widely accepted that server side Java solutions such as C<JServ>, C<JSP> and C<JRUN>, are far slower then mod_perl solutions (see next myth). Even so, Java is often used as the front end for server side Perl applications. Unlike Perl, with Java you can create advanced client side applications. Combined with the strength of server side Perl these client side Java applications can be made very powerful. =back =head2 Perl can't create advanced client side applications =over 4 =item R: True. There are some client side Perl solutions like PerlScript in MSIE 5.0, but all client side Perl requires the user to have the Perl interpreter on their local machine. Most users do not have a Perl interpreter on their local machine. Most Perl programmers who need to create an advanced client side application use Java as their client side programming language and Perl as the server side solution. =back =head2 ASP makes Perl obsolete as a web programming language. =over 4 =item M: With Perl you have to write individual programs for each set of pages. With ASP you can write simple code directly within HTML pages. ASP is the Perl killer. =item R: There are many solutions which allow you to embed Perl in web pages just like ASP. In fact, you can actually use Perl IN ASP pages with PerlScript. Other solutions include: C<Mason>, C<Apache::ASP>, C<ePerl>, C<embPerl> and C<XPP>. Also, Microsoft and ActiveState have worked very hard to make Perl run equally well on NT as Unix. You can even create COM modules in Perl that can be used from within ASP pages. Some other advantages Perl has over ASP: mod_perl is usually much faster then ASP, Perl has much more example code and full programs which are freely downloadable, and Perl is cross platform, able to run on Solaris, Linux, SCO, Digital Unix, Unix V, AIX, OS2, VMS MacOS, Win95-98 and NT to name a few. Also, Benchmarks show that embedded Perl solutions outperform ASP/VB on IIS by several orders of magnitude. Perl is a much easier language for some to learn, especially those with a background in C or C++. =back =head1 Credits Thanks to the mod_perl list for all of the good information and criticism. I'd especially like to thank, =over 4 =item * Stas Bekman E<lt>[EMAIL PROTECTED]<gt> =item * Thornton Prime E<lt>[EMAIL PROTECTED]<gt> =item * Chip Turner E<lt>[EMAIL PROTECTED]<gt> =item * Clinton E<lt>[EMAIL PROTECTED]<gt> =item * Joshua Chamas E<lt>[EMAIL PROTECTED]<gt> =item * John Edstrom E<lt>[EMAIL PROTECTED]<gt> =item * Rasmus Lerdorf E<lt>[EMAIL PROTECTED]<gt> =item * Nedim Cholich E<lt>[EMAIL PROTECTED]<gt> =item * Mike Perry E<lt> http://www.icorp.net/icorp/feedback.htm E<gt> =item * Finally, I'd like to thank Robert Santos E<lt>[EMAIL PROTECTED]<gt>, CyberNation's lead Business Development guy for inspiring this document. =back =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Contact the L<mod_perl docs list|maillist::list-docs-dev> =back =head1 Authors =over =item * Adam Pisoni E<lt>[EMAIL PROTECTED]<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut 1.1 modperl-docs/src/docs/general/perl_reference.pod Index: perl_reference.pod =================================================================== =head1 NAME Perl Reference =head1 Description This document was born because some users are reluctant to learn Perl, prior to jumping into mod_perl. I will try to cover some of the most frequent pure Perl questions being asked at the list. Before you decide to skip this chapter make sure you know all the information provided here. The rest of the Guide assumes that you have read this chapter and understood it. =head1 perldoc's Rarely Known But Very Useful Options First of all, I want to stress that you cannot become a Perl hacker without knowing how to read Perl documentation and search through it. Books are good, but an easily accessible and searchable Perl reference at your fingertips is a great time saver. It always has the up-to-date information for the version of perl you're using. Of course you can use online Perl documentation at the Web. The two major sites are http://www.perldoc.com and http://theoryx5.uwinnipeg.ca/CPAN/perl/. The C<perldoc> utility provides you with access to the documentation installed on your system. To find out what Perl manpages are available execute: % perldoc perl To find what functions perl has, execute: % perldoc perlfunc To learn the syntax and to find examples of a specific function, you would execute (e.g. for C<open()>): % perldoc -f open Note: In perl5.005_03 and earlier, there is a bug in this and the C<-q> options of C<perldoc>. It won't call C<pod2man>, but will display the section in POD format instead. Despite this bug it's still readable and very useful. The Perl FAQ (I<perlfaq> manpage) is in several sections. To search through the sections for C<open> you would execute: % perldoc -q open This will show you all the matching Question and Answer sections, still in POD format. To read the I<perldoc> manpage you would execute: % perldoc perldoc =head1 Tracing Warnings Reports Sometimes it's very hard to understand what a warning is complaining about. You see the source code, but you cannot understand why some specific snippet produces that warning. The mystery often results from the fact that the code can be called from different places if it's located inside a subroutine. Here is an example: warnings.pl ----------- #!/usr/bin/perl -w use strict; correct(); incorrect(); sub correct{ print_value("Perl"); } sub incorrect{ print_value(); } sub print_value{ my $var = shift; print "My value is $var\n"; } In the code above, print_value() prints the passed value. Subroutine correct() passes the value to print, but in subroutine incorrect() we forgot to pass it. When we run the script: % ./warnings.pl we get the warning: Use of uninitialized value at ./warnings.pl line 16. Perl complains about an undefined variable C<$var> at the line that attempts to print its value: print "My value is $var\n"; But how do we know why it is undefined? The reason here obviously is that the calling function didn't pass the argument. But how do we know who was the caller? In our example there are two possible callers, in the general case there can be many of them, perhaps located in other files. We can use the caller() function, which tells who has called us, but even that might not be enough: it's possible to have a longer sequence of called subroutines, and not just two. For example, here it is sub third() which is at fault, and putting sub caller() in sub second() would not help us very much: sub third{ second(); } sub second{ my $var = shift; first($var); } sub first{ my $var = shift; print "Var = $var\n" } The solution is quite simple. What we need is a full calls stack trace to the call that triggered the warning. The C<Carp> module comes to our aid with its cluck() function. Let's modify the script by adding a couple of lines. The rest of the script is unchanged. warnings2.pl ----------- #!/usr/bin/perl -w use strict; use Carp (); local $SIG{__WARN__} = \&Carp::cluck; correct(); incorrect(); sub correct{ print_value("Perl"); } sub incorrect{ print_value(); } sub print_value{ my $var = shift; print "My value is $var\n"; } Now when we execute it, we see: Use of uninitialized value at ./warnings2.pl line 19. main::print_value() called at ./warnings2.pl line 14 main::incorrect() called at ./warnings2.pl line 7 Take a moment to understand the calls stack trace. The deepest calls are printed first. So the second line tells us that the warning was triggered in print_value(); the third, that print_value() was called by subroutine, incorrect(). script => incorrect() => print_value() We go into C<incorrect()> and indeed see that we forgot to pass the variable. Of course when you write a subroutine like C<print_value> it would be a good idea to check the passed arguments before starting execution. We omitted that step to contrive an easily debugged example. Sure, you say, I could find that problem by simple inspection of the code! Well, you're right. But I promise you that your task would be quite complicated and time consuming if your code has some thousands of lines. In addition, under mod_perl, certain uses of the C<eval> operator and "here documents" are known to throw off Perl's line numbering, so the messages reporting warnings and errors can have incorrect line numbers. (See L<Finding the Line Which Triggered the Error or Warning|guide::debug/Finding_the_Line_Which_Triggered> for more information). Getting the trace helps a lot. =head1 Variables Globally, Lexically Scoped And Fully Qualified META: this material is new and requires polishing so read with care. You will hear a lot about namespaces, symbol tables and lexical scoping in Perl discussions, but little of it will make any sense without a few key facts: =head2 Symbols, Symbol Tables and Packages; Typeglobs There are two important types of symbol: package global and lexical. We will talk about lexical symbols later, for now we will talk only about package global symbols, which we will refer to simply as I<global symbols>. The names of pieces of your code (subroutine names) and the names of your global variables are symbols. Global symbols reside in one symbol table or another. The code itself and the data do not; the symbols are the names of pointers which point (indirectly) to the memory areas which contain the code and data. (Note for C/C++ programmers: we use the term `pointer' in a general sense of one piece of data referring to another piece of data not in a specific sense as used in C or C++.) There is one symbol table for each package, (which is why I<global symbols> are really I<package global symbols>). You are always working in one package or another. Like in C, where the first function you write must be called main(), the first statement of your first Perl script is in package C<main::> which is the default package. Unless you say otherwise by using the C<package> statement, your symbols are all in package C<main::>. You should be aware straight away that files and packages are I<not related>. You can have any number of packages in a single file; and a single package can be in one file or spread over many files. However it is very common to have a single package in a single file. To declare a package you write: package mypackagename; From the following line you are in package C<mypackagename> and any symbols you declare reside in that package. When you create a symbol (variable, subroutine etc.) Perl uses the name of the package in which you are currently working as a prefix to create the fully qualified name of the symbol. When you create a symbol, Perl creates a symbol table entry for that symbol in the current package's symbol table (by default C<main::>). Each symbol table entry is called a I<typeglob>. Each typeglob can hold information on a scalar, an array, a hash, a subroutine (code), a filehandle, a directory handle and a format, each of which all have the same name. So you see now that there are two indirections for a global variable: the symbol, (the thing's name), points to its typeglob and the typeglob for the thing's type (scalar, array, etc.) points to the data. If we had a scalar and an array with the same name their name would point to the same typeglob, but for each type of data the typeglob points to somewhere different and so the scalar's data and the array's data are completely separate and independent, they just happen to have the same name. Most of the time, only one part of a typeglob is used (yes, it's a bit wasteful). You will by now know that you distinguish between them by using what the authors of the Camel book call a I<funny character>. So if we have a scalar called `C<line>' we would refer to it in code as C<$line>, and if we had an array of the same name, that would be written, C<@line>. Both would point to the same typeglob (which would be called C<*line>), but because of the I<funny character> (also known as I<decoration>) perl won't confuse the two. Of course we might confuse ourselves, so some programmers don't ever use the same name for more than one type of variable. Every global symbol is in some package's symbol table. To refer to a global symbol we could write the I<fully qualified> name, e.g. C<$main::line>. If we are in the same package as the symbol we can omit the package name, e.g. C<$line> (unless you use the C<strict> pragma and then you will have to predeclare the variable using the C<vars> pragma). We can also omit the package name if we have imported the symbol into our current package's namespace. If we want to refer to a symbol that is in another package and which we haven't imported we must use the fully qualified name, e.g. C<$otherpkg::box>. Most of the time you do not need to use the fully qualified symbol name because most of the time you will refer to package variables from within the package. This is very like C++ class variables. You can work entirely within package C<main::> and never even know you are using a package, nor that the symbols have package names. In a way, this is a pity because you may fail to learn about packages and they are extremely useful. The exception is when you I<import> the variable from another package. This creates an alias for the variable in the I<current> package, so that you can access it without using the fully qualified name. Whilst global variables are useful for sharing data and are necessary in some contexts it is usually wisest to minimize their use and use I<lexical variables>, discussed next, instead. Note that when you create a variable, the low-level business of allocating memory to store the information is handled automatically by Perl. The intepreter keeps track of the chunks of memory to which the pointers are pointing and takes care of undefining variables. When all references to a variable have ceased to exist then the perl garbage collector is free to take back the memory used ready for recycling. However perl almost never returns back memory it has already used to the operating system during the lifetime of the process. =head3 Lexical Variables and Symbols The symbols for lexical variables (i.e. those declared using the keyword C<my>) are the only symbols which do I<not> live in a symbol table. Because of this, they are not available from outside the block in which they are declared. There is no typeglob associated with a lexical variable and a lexical variable can refer only to a scalar, an array, a hash or a code reference. (Since perl-5.6 it can also refer to a file glob). If you need access to the data from outside the package then you can return it from a subroutine, or you can create a global variable (i.e. one which has a package prefix) which points or refers to it and return that. The pointer or reference must be global so that you can refer to it by a fully qualified name. But just like in C try to avoid having global variables. Using OO methods generally solves this problem, by providing methods to get and set the desired value within the object that can be lexically scoped inside the package and passed by reference. The phrase "lexical variable" is a bit of a misnomer, we are really talking about "lexical symbols". The data can be referenced by a global symbol too, and in such cases when the lexical symbol goes out of scope the data will still be accessible through the global symbol. This is perfectly legitimate and cannot be compared to the terrible mistake of taking a pointer to an automatic C variable and returning it from a function--when the pointer is dereferenced there will be a segmentation fault. (Note for C/C++ programmers: having a function return a pointer to an auto variable is a disaster in C or C++; the perl equivalent, returning a reference to a lexical variable created in a function is normal and useful.) =over =item * C<my()> vs. C<use vars>: With use vars(), you are making an entry in the symbol table, and you are telling the compiler that you are going to be referencing that entry without an explicit package name. With my(), NO ENTRY IS PUT IN THE SYMBOL TABLE. The compiler figures out C<at compile time> which my() variables (i.e. lexical variables) are the same as each other, and once you hit execute time you cannot go looking those variables up in the symbol table. =item * C<my()> vs. C<local()>: local() creates a temporal-limited package-based scalar, array, hash, or glob -- when the scope of definition is exited at runtime, the previous value (if any) is restored. References to such a variable are *also* global... only the value changes. (Aside: that is what causes variable suicide. :) my() creates a lexically-limited non-package-based scalar, array, or hash -- when the scope of definition is exited at compile-time, the variable ceases to be accessible. Any references to such a variable at runtime turn into unique anonymous variables on each scope exit. =back =head2 Additional reading references For more information see: L<Using global variables and sharing them between modules/packages|guide::perl/Using_Global_Variables_and_Shari> and an article by Mark-Jason Dominus about how Perl handles variables and namespaces, and the difference between C<use vars()> and C<my()> - http://www.plover.com/~mjd/perl/FAQs/Namespaces.html . =head1 my() Scoped Variable in Nested Subroutines Before we proceed let's make the assumption that we want to develop the code under the C<strict> pragma. We will use lexically scoped variables (with help of the my() operator) whenever it's possible. =head2 The Poison Let's look at this code: nested.pl ----------- #!/usr/bin/perl use strict; sub print_power_of_2 { my $x = shift; sub power_of_2 { return $x ** 2; } my $result = power_of_2(); print "$x^2 = $result\n"; } print_power_of_2(5); print_power_of_2(6); Don't let the weird subroutine names fool you, the print_power_of_2() subroutine should print the square of the number passed to it. Let's run the code and see whether it works: % ./nested.pl 5^2 = 25 6^2 = 25 Ouch, something is wrong. May be there is a bug in Perl and it doesn't work correctly with the number 6? Let's try again using 5 and 7: print_power_of_2(5); print_power_of_2(7); And run it: % ./nested.pl 5^2 = 25 7^2 = 25 Wow, does it works only for 5? How about using 3 and 5: print_power_of_2(3); print_power_of_2(5); and the result is: % ./nested.pl 3^2 = 9 5^2 = 9 Now we start to understand--only the first call to the print_power_of_2() function works correctly. Which makes us think that our code has some kind of memory for the results of the first execution, or it ignores the arguments in subsequent executions. =head2 The Diagnosis Let's follow the guidelines and use the C<-w> flag. Now execute the code: % ./nested.pl Variable "$x" will not stay shared at ./nested.pl line 9. 5^2 = 25 6^2 = 25 We have never seen such a warning message before and we don't quite understand what it means. The C<diagnostics> pragma will certainly help us. Let's prepend this pragma before the C<strict> pragma in our code: #!/usr/bin/perl -w use diagnostics; use strict; And execute it: % ./nested.pl Variable "$x" will not stay shared at ./nested.pl line 10 (#1) (W) An inner (nested) named subroutine is referencing a lexical variable defined in an outer subroutine. When the inner subroutine is called, it will probably see the value of the outer subroutine's variable as it was before and during the *first* call to the outer subroutine; in this case, after the first call to the outer subroutine is complete, the inner and outer subroutines will no longer share a common value for the variable. In other words, the variable will no longer be shared. Furthermore, if the outer subroutine is anonymous and references a lexical variable outside itself, then the outer and inner subroutines will never share the given variable. This problem can usually be solved by making the inner subroutine anonymous, using the sub {} syntax. When inner anonymous subs that reference variables in outer subroutines are called or referenced, they are automatically rebound to the current values of such variables. 5^2 = 25 6^2 = 25 Well, now everything is clear. We have the B<inner> subroutine power_of_2() and the B<outer> subroutine print_power_of_2() in our code. When the inner power_of_2() subroutine is called for the first time, it sees the value of the outer print_power_of_2() subroutine's C<$x> variable. On subsequent calls the inner subroutine's C<$x> variable won't be updated, no matter what new values are given to C<$x> in the outer subroutine. There are two copies of the C<$x> variable, no longer a single one shared by the two routines. =head2 The Remedy The C<diagnostics> pragma suggests that the problem can be solved by making the inner subroutine anonymous. An anonymous subroutine can act as a I<closure> with respect to lexically scoped variables. Basically this means that if you define a subroutine in a particular B<lexical> context at a particular moment, then it will run in that same context later, even if called from outside that context. The upshot of this is that when the subroutine B<runs>, you get the same copies of the lexically scoped variables which were visible when the subroutine was B<defined>. So you can pass arguments to a function when you define it, as well as when you invoke it. Let's rewrite the code to use this technique: anonymous.pl -------------- #!/usr/bin/perl use strict; sub print_power_of_2 { my $x = shift; my $func_ref = sub { return $x ** 2; }; my $result = &$func_ref(); print "$x^2 = $result\n"; } print_power_of_2(5); print_power_of_2(6); Now C<$func_ref> contains a reference to an anonymous subroutine, which we later use when we need to get the power of two. Since it is anonymous, the subroutine will automatically be rebound to the new value of the outer scoped variable C<$x>, and the results will now be as expected. Let's verify: % ./anonymous.pl 5^2 = 25 6^2 = 36 So we can see that the problem is solved. =head1 Understanding Closures -- the Easy Way In Perl, a closure is just a subroutine that refers to one or more lexical variables declared outside the subroutine itself and must therefore create a distinct clone of the environment on the way out. And both named subroutines and anonymous subroutines can be closures. Here's how to tell if a subroutine is a closure or not: for (1..5) { push @a, sub { "hi there" }; } for (1..5) { { my $b; push @b, sub { $b."hi there" }; } } print "anon normal:\n", join "\t\n",@a,"\n"; print "anon closure:\n",join "\t\n",@b,"\n"; which generates: anon normal: CODE(0x80568e4) CODE(0x80568e4) CODE(0x80568e4) CODE(0x80568e4) CODE(0x80568e4) anon closure: CODE(0x804b4c0) CODE(0x8056b54) CODE(0x8056bb4) CODE(0x80594d8) CODE(0x8059538) Note how each code reference from the non-closure is identical, but the closure form must generate distinct coderefs to point at the distinct instances of the closure. And now the same with named subroutines: for (1..5) { sub a { "hi there" }; push @a, \&a; } for (1..5) { { my $b; sub b { $b."hi there" }; push @b, \&b; } } print "normal:\n", join "\t\n",@a,"\n"; print "closure:\n",join "\t\n",@b,"\n"; which generates: anon normal: CODE(0x80568c0) CODE(0x80568c0) CODE(0x80568c0) CODE(0x80568c0) CODE(0x80568c0) anon closure: CODE(0x8056998) CODE(0x8056998) CODE(0x8056998) CODE(0x8056998) CODE(0x8056998) We can see that both versions has generated the same code reference. For the subroutine I<a> it's easy, since it doesn't include any lexical variables defined outside it in the same lexical scope. As for the subroutine I<b>, it's indeed a closure, but Perl won't recompile it since it's a named subroutine (see the I<perlsub> manpage). It's something that we don't want to happen in our code unless we want it for this special effect, similar to I<static> variables in C. This is the underpinnings of that famous I<"won't stay shared"> message. A I<my> variable in a named subroutine context is generating identical code references and therefore it ignores any future changes to the lexical variables outside of it. =head1 When You Cannot Get Rid of The Inner Subroutine First you might wonder, why in the world will someone need to define an inner subroutine? Well, for example to reduce some of Perl's script startup overhead you might decide to write a daemon that will compile the scripts and modules only once, and cache the pre-compiled code in memory. When some script is to be executed, you just tell the daemon the name of the script to run and it will do the rest and do it much faster since compilation has already taken place. Seems like an easy task, and it is. The only problem is once the script is compiled, how do you execute it? Or let's put it the other way: after it was executed for the first time and it stays compiled in the daemon's memory, how do you call it again? If you could get all developers to code their scripts so each has a subroutine called run() that will actually execute the code in the script then we've solved half the problem. But how does the daemon know to refer to some specific script if they all run in the C<main::> name space? One solution might be to ask the developers to declare a package in each and every script, and for the package name to be derived from the script name. However, since there is a chance that there will be more than one script with the same name but residing in different directories, then in order to prevent namespace collisions the directory has to be a part of the package name too. And don't forget that the script may be moved from one directory to another, so you will have to make sure that the package name is corrected every time the script gets moved. But why enforce these strange rules on developers, when we can arrange for our daemon to do this work? For every script that the daemon is about to execute for the first time, the script should be wrapped inside the package whose name is constructed from the mangled path to the script and a subroutine called run(). For example if the daemon is about to execute the script I</tmp/hello.pl>: hello.pl -------- #!/usr/bin/perl print "Hello\n"; Prior to running it, the daemon will change the code to be: wrapped_hello.pl ---------------- package cache::tmp::hello_2epl; sub run{ #!/usr/bin/perl print "Hello\n"; } The package name is constructed from the prefix C<cache::>, each directory separation slash is replaced with C<::>, and non alphanumeric characters are encoded so that for example C<.> (a dot) becomes C<_2e> (an underscore followed by the ASCII code for a dot in hex representation). % perl -e 'printf "%x",ord(".")' prints: C<2e>. The underscore is the same you see in URL encoding except the C<%> character is used instead (C<%2E>), but since C<%> has a special meaning in Perl (prefix of hash variable) it couldn't be used. Now when the daemon is requested to execute the script I</tmp/hello.pl>, all it has to do is to build the package name as before based on the location of the script and call its run() subroutine: use cache::tmp::hello_2epl; cache::tmp::hello_2epl::run(); We have just written a partial prototype of the daemon we wanted. The only outstanding problem is how to pass the path to the script to the daemon. This detail is left as an exercise for the reader. If you are familiar with the C<Apache::Registry> module, you know that it works in almost the same way. It uses a different package prefix and the generic function is called handler() and not run(). The scripts to run are passed through the HTTP protocol's headers. Now you understand that there are cases where your normal subroutines can become inner, since if your script was a simple: simple.pl --------- #!/usr/bin/perl sub hello { print "Hello" } hello(); Wrapped into a run() subroutine it becomes: simple.pl --------- package cache::simple_2epl; sub run{ #!/usr/bin/perl sub hello { print "Hello" } hello(); } Therefore, hello() is an inner subroutine and if you have used my() scoped variables defined and altered outside and used inside hello(), it won't work as you expect starting from the second call, as was explained in the previous section. =head2 Remedies for Inner Subroutines First of all there is nothing to worry about, as long as you don't forget to turn the warnings On. If you do happen to have the "L<my() Scoped Variable in Nested Subroutines|guide::perl/my_Scoped_Variable_in_Nested_S>" problem, Perl will always alert you. Given that you have a script that has this problem, what are the ways to solve it? There are many of them and we will discuss some of them here. We will use the following code to show the different solutions. multirun.pl ----------- #!/usr/bin/perl -w use strict; for (1..3){ print "run: [time $_]\n"; run(); } sub run{ my $counter = 0; increment_counter(); increment_counter(); sub increment_counter{ $counter++; print "Counter is equal to $counter !\n"; } } # end of sub run This code executes the run() subroutine three times, which in turn initializes the C<$counter> variable to 0, every time it is executed and then calls the inner subroutine increment_counter() twice. Sub increment_counter() prints C<$counter>'s value after incrementing it. One might expect to see the following output: run: [time 1] Counter is equal to 1 ! Counter is equal to 2 ! run: [time 2] Counter is equal to 1 ! Counter is equal to 2 ! run: [time 3] Counter is equal to 1 ! Counter is equal to 2 ! But as we have already learned from the previous sections, this is not what we are going to see. Indeed, when we run the script we see: % ./multirun.pl Variable "$counter" will not stay shared at ./nested.pl line 18. run: [time 1] Counter is equal to 1 ! Counter is equal to 2 ! run: [time 2] Counter is equal to 3 ! Counter is equal to 4 ! run: [time 3] Counter is equal to 5 ! Counter is equal to 6 ! Obviously, the C<$counter> variable is not reinitialized on each execution of run(). It retains its value from the previous execution, and sub increment_counter() increments that. One of the workarounds is to use globally declared variables, with the C<vars> pragma. multirun1.pl ----------- #!/usr/bin/perl -w use strict; use vars qw($counter); for (1..3){ print "run: [time $_]\n"; run(); } sub run { $counter = 0; increment_counter(); increment_counter(); sub increment_counter{ $counter++; print "Counter is equal to $counter !\n"; } } # end of sub run If you run this and the other solutions offered below, the expected output will be generated: % ./multirun1.pl run: [time 1] Counter is equal to 1 ! Counter is equal to 2 ! run: [time 2] Counter is equal to 1 ! Counter is equal to 2 ! run: [time 3] Counter is equal to 1 ! Counter is equal to 2 ! By the way, the warning we saw before has gone, and so has the problem, since there is no C<my()> (lexically defined) variable used in the nested subroutine. Another approach is to use fully qualified variables. This is better, since less memory will be used, but it adds a typing overhead: multirun2.pl ----------- #!/usr/bin/perl -w use strict; for (1..3){ print "run: [time $_]\n"; run(); } sub run { $main::counter = 0; increment_counter(); increment_counter(); sub increment_counter{ $main::counter++; print "Counter is equal to $main::counter !\n"; } } # end of sub run You can also pass the variable to the subroutine by value and make the subroutine return it after it was updated. This adds time and memory overheads, so it may not be good idea if the variable can be very large, or if speed of execution is an issue. Don't rely on the fact that the variable is small during the development of the application, it can grow quite big in situations you don't expect. For example, a very simple HTML form text entry field can return a few megabytes of data if one of your users is bored and wants to test how good your code is. It's not uncommon to see users copy-and-paste 10Mb core dump files into a form's text fields and then submit it for your script to process. multirun3.pl ----------- #!/usr/bin/perl -w use strict; for (1..3){ print "run: [time $_]\n"; run(); } sub run { my $counter = 0; $counter = increment_counter($counter); $counter = increment_counter($counter); sub increment_counter{ my $counter = shift; $counter++; print "Counter is equal to $counter !\n"; return $counter; } } # end of sub run Finally, you can use references to do the job. The version of increment_counter() below accepts a reference to the C<$counter> variable and increments its value after first dereferencing it. When you use a reference, the variable you use inside the function is physically the same bit of memory as the one outside the function. This technique is often used to enable a called function to modify variables in a calling function. multirun4.pl ----------- #!/usr/bin/perl -w use strict; for (1..3){ print "run: [time $_]\n"; run(); } sub run { my $counter = 0; increment_counter(\$counter); increment_counter(\$counter); sub increment_counter{ my $r_counter = shift; $$r_counter++; print "Counter is equal to $$r_counter !\n"; } } # end of sub run Here is yet another and more obscure reference usage. We modify the value of C<$counter> inside the subroutine by using the fact that variables in C<@_> are aliases for the actual scalar parameters. Thus if you called a function with two arguments, those would be stored in C<$_[0]> and C<$_[1]>. In particular, if an element C<$_[0]> is updated, the corresponding argument is updated (or an error occurs if it is not updatable as would be the case of calling the function with a literal, e.g. I<increment_counter(5)>). multirun5.pl ----------- #!/usr/bin/perl -w use strict; for (1..3){ print "run: [time $_]\n"; run(); } sub run { my $counter = 0; increment_counter($counter); increment_counter($counter); sub increment_counter{ $_[0]++; print "Counter is equal to $_[0] !\n"; } } # end of sub run The approach given above should be properly documented of course. Here is a solution that avoids the problem entirely by splitting the code into two files; the first is really just a wrapper and loader, the second file contains the heart of the code. multirun6.pl ----------- #!/usr/bin/perl -w use strict; require 'multirun6-lib.pl' ; for (1..3){ print "run: [time $_]\n"; run(); } Separate file: multirun6-lib.pl ---------------- use strict ; my $counter; sub run { $counter = 0; increment_counter(); increment_counter(); } sub increment_counter{ $counter++; print "Counter is equal to $counter !\n"; } 1 ; Now you have at least six workarounds to choose from. For more information please refer to perlref and perlsub manpages. =head1 use(), require(), do(), %INC and @INC Explained =head2 The @INC array C<@INC> is a special Perl variable which is the equivalent of the shell's C<PATH> variable. Whereas C<PATH> contains a list of directories to search for executables, C<@INC> contains a list of directories from which Perl modules and libraries can be loaded. When you use(), require() or do() a filename or a module, Perl gets a list of directories from the C<@INC> variable and searches them for the file it was requested to load. If the file that you want to load is not located in one of the listed directories, you have to tell Perl where to find the file. You can either provide a path relative to one of the directories in C<@INC>, or you can provide the full path to the file. =head2 The %INC hash C<%INC> is another special Perl variable that is used to cache the names of the files and the modules that were successfully loaded and compiled by use(), require() or do() statements. Before attempting to load a file or a module with use() or require(), Perl checks whether it's already in the C<%INC> hash. If it's there, the loading and therefore the compilation are not performed at all. Otherwise the file is loaded into memory and an attempt is made to compile it. do() does unconditional loading--no lookup in the C<%INC> hash is made. If the file is successfully loaded and compiled, a new key-value pair is added to C<%INC>. The key is the name of the file or module as it was passed to the one of the three functions we have just mentioned, and if it was found in any of the C<@INC> directories except C<"."> the value is the full path to it in the file system. The following examples will make it easier to understand the logic. First, let's see what are the contents of C<@INC> on my system: % perl -e 'print join "\n", @INC' /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 . Notice the C<.> (current directory) is the last directory in the list. Now let's load the module C<strict.pm> and see the contents of C<%INC>: % perl -e 'use strict; print map {"$_ => $INC{$_}\n"} keys %INC' strict.pm => /usr/lib/perl5/5.00503/strict.pm Since C<strict.pm> was found in I</usr/lib/perl5/5.00503/> directory and I</usr/lib/perl5/5.00503/> is a part of C<@INC>, C<%INC> includes the full path as the value for the key C<strict.pm>. Now let's create the simplest module in C</tmp/test.pm>: test.pm ------- 1; It does nothing, but returns a true value when loaded. Now let's load it in different ways: % cd /tmp % perl -e 'use test; print map {"$_ => $INC{$_}\n"} keys %INC' test.pm => test.pm Since the file was found relative to C<.> (the current directory), the relative path is inserted as the value. If we alter C<@INC>, by adding I</tmp> to the end: % cd /tmp % perl -e 'BEGIN{push @INC, "/tmp"} use test; \ print map {"$_ => $INC{$_}\n"} keys %INC' test.pm => test.pm Here we still get the relative path, since the module was found first relative to C<".">. The directory I</tmp> was placed after C<.> in the list. If we execute the same code from a different directory, the C<"."> directory won't match, % cd / % perl -e 'BEGIN{push @INC, "/tmp"} use test; \ print map {"$_ => $INC{$_}\n"} keys %INC' test.pm => /tmp/test.pm so we get the full path. We can also prepend the path with unshift(), so it will be used for matching before C<"."> and therefore we will get the full path as well: % cd /tmp % perl -e 'BEGIN{unshift @INC, "/tmp"} use test; \ print map {"$_ => $INC{$_}\n"} keys %INC' test.pm => /tmp/test.pm The code: BEGIN{unshift @INC, "/tmp"} can be replaced with the more elegant: use lib "/tmp"; Which is almost equivalent to our C<BEGIN> block and is the recommended approach. These approaches to modifying C<@INC> can be labor intensive, since if you want to move the script around in the file-system you have to modify the path. This can be painful, for example, when you move your scripts from development to a production server. There is a module called C<FindBin> which solves this problem in the plain Perl world, but unfortunately it won't work under mod_perl, since it's a module and as any module it's loaded only once. So the first script using it will have all the settings correct, but the rest of the scripts will not if located in a different directory from the first. For the sake of completeness, I'll present this module anyway. If you use this module, you don't need to write a hard coded path. The following snippet does all the work for you (the file is I</tmp/load.pl>): load.pl ------- #!/usr/bin/perl use FindBin (); use lib "$FindBin::Bin"; use test; print "test.pm => $INC{'test.pm'}\n"; In the above example C<$FindBin::Bin> is equal to I</tmp>. If we move the script somewhere else... e.g. I</tmp/new_dir> in the code above C<$FindBin::Bin> equals I</tmp/new_dir>. % /tmp/load.pl test.pm => /tmp/test.pm This is just like C<use lib> except that no hard coded path is required. You can use this workaround to make it work under mod_perl. do 'FindBin.pm'; unshift @INC, "$FindBin::Bin"; require test; #maybe test::import( ... ) here if need to import stuff This has a slight overhead because it will load from disk and recompile the C<FindBin> module on each request. So it may not be worth it. =head2 Modules, Libraries and Program Files Before we proceed, let's define what we mean by I<module>, I<library> and I<program file>. =over =item * Libraries These are files which contain Perl subroutines and other code. When these are used to break up a large program into manageable chunks they don't generally include a package declaration; when they are used as subroutine libraries they often do have a package declaration. Their last statement returns true, a simple C<1;> statement ensures that. They can be named in any way desired, but generally their extension is I<.pl>. Examples: config.pl ---------- # No package so defaults to main:: $dir = "/home/httpd/cgi-bin"; $cgi = "/cgi-bin"; 1; mysubs.pl ---------- # No package so defaults to main:: sub print_header{ print "Content-type: text/plain\r\n\r\n"; } 1; web.pl ------------ package web ; # Call like this: web::print_with_class('loud',"Don't shout!"); sub print_with_class{ my( $class, $text ) = @_ ; print qq{<span class="$class">$text</span>}; } 1; =item * Modules A file which contains perl subroutines and other code. It generally declares a package name at the beginning of it. Modules are generally used either as function libraries (which I<.pl> files are still but less commonly used for), or as object libraries where a module is used to define a class and its methods. Its last statement returns true. The naming convention requires it to have a I<.pm> extension. Example: MyModule.pm ----------- package My::Module; $My::Module::VERSION = 0.01; sub new{ return bless {}, shift;} END { print "Quitting\n"} 1; =item * Program Files Many Perl programs exist as a single file. Under Linux and other Unix-like operating systems the file often has no suffix since the operating system can determine that it is a perl script from the first line (shebang line) or if it's Apache that executes the code, there is a variety of ways to tell how and when the file should be executed. Under Windows a suffix is normally used, for example C<.pl> or C<.plx>. The program file will normally C<require()> any libraries and C<use()> any modules it requires for execution. It will contain Perl code but won't usually have any package names. Its last statement may return anything or nothing. =back =head2 require() require() reads a file containing Perl code and compiles it. Before attempting to load the file it looks up the argument in C<%INC> to see whether it has already been loaded. If it has, require() just returns without doing a thing. Otherwise an attempt will be made to load and compile the file. require() has to find the file it has to load. If the argument is a full path to the file, it just tries to read it. For example: require "/home/httpd/perl/mylibs.pl"; If the path is relative, require() will attempt to search for the file in all the directories listed in C<@INC>. For example: require "mylibs.pl"; If there is more than one occurrence of the file with the same name in the directories listed in C<@INC> the first occurrence will be used. The file must return I<TRUE> as the last statement to indicate successful execution of any initialization code. Since you never know what changes the file will go through in the future, you cannot be sure that the last statement will always return I<TRUE>. That's why the suggestion is to put "C<1;>" at the end of file. Although you should use the real filename for most files, if the file is a L<module|guide::perl/Modules__Libraries_and_Program_Files>, you may use the following convention instead: require My::Module; This is equal to: require "My/Module.pm"; If require() fails to load the file, either because it couldn't find the file in question or the code failed to compile, or it didn't return I<TRUE>, then the program would die(). To prevent this the require() statement can be enclosed into an eval() exception-handling block, as in this example: require.pl ---------- #!/usr/bin/perl -w eval { require "/file/that/does/not/exists"}; if ($@) { print "Failed to load, because : $@" } print "\nHello\n"; When we execute the program: % ./require.pl Failed to load, because : Can't locate /file/that/does/not/exists in @INC (@INC contains: /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 .) at require.pl line 3. Hello We see that the program didn't die(), because I<Hello> was printed. This I<trick> is useful when you want to check whether a user has some module installed, but if she hasn't it's not critical, perhaps the program can run without this module with reduced functionality. If we remove the eval() part and try again: require.pl ---------- #!/usr/bin/perl -w require "/file/that/does/not/exists"; print "\nHello\n"; % ./require1.pl Can't locate /file/that/does/not/exists in @INC (@INC contains: /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 .) at require1.pl line 3. The program just die()s in the last example, which is what you want in most cases. For more information refer to the perlfunc manpage. =head2 use() use(), just like require(), loads and compiles files containing Perl code, but it works with L<modules|guide::perl/Modules__Libraries_and_Program_Files> only and is executed at compile time. The only way to pass a module to load is by its module name and not its filename. If the module is located in I<MyCode.pm>, the correct way to use() it is: use MyCode and not: use "MyCode.pm" use() translates the passed argument into a file name replacing C<::> with the operating system's path separator (normally C</>) and appending I<.pm> at the end. So C<My::Module> becomes I<My/Module.pm>. use() is exactly equivalent to: BEGIN { require Module; Module->import(LIST); } Internally it calls require() to do the loading and compilation chores. When require() finishes its job, import() is called unless C<()> is the second argument. The following pairs are equivalent: use MyModule; BEGIN {require MyModule; MyModule->import; } use MyModule qw(foo bar); BEGIN {require MyModule; MyModule->import("foo","bar"); } use MyModule (); BEGIN {require MyModule; } The first pair exports the default tags. This happens if the module sets C<@EXPORT> to a list of tags to be exported by default. The module's manpage normally describes what tags are exported by default. The second pair exports only the tags passed as arguments. The third pair describes the case where the caller does not want any symbols to be imported. C<import()> is not a builtin function, it's just an ordinary static method call into the "C<MyModule>" package to tell the module to import the list of features back into the current package. See the Exporter manpage for more information. When you write your own modules, always remember that it's better to use C<@EXPORT_OK> instead of C<@EXPORT>, since the former doesn't export symbols unless it was asked to. Exports pollute the namespace of the module user. Also avoid short or common symbol names to reduce the risk of name clashes. When functions and variables aren't exported you can still access them using their full names, like C<$My::Module::bar> or C<$My::Module::foo()>. By convention you can use a leading underscore on names to informally indicate that they are I<internal> and not for public use. There's a corresponding "C<no>" command that un-imports symbols imported by C<use>, i.e., it calls C<Module-E<gt>unimport(LIST)> instead of C<import()>. =head2 do() While do() behaves almost identically to require(), it reloads the file unconditionally. It doesn't check C<%INC> to see whether the file was already loaded. If do() cannot read the file, it returns C<undef> and sets C<$!> to report the error. If do() can read the file but cannot compile it, it returns C<undef> and puts an error message in C<$@>. If the file is successfully compiled, do() returns the value of the last expression evaluated. =head1 Using Global Variables and Sharing Them Between Modules/Packages It helps when you code your application in a structured way, using the perl packages, but as you probably know once you start using packages it's much harder to share the variables between the various packagings. A configuration package comes to mind as a good example of the package that will want its variables to be accessible from the other modules. Of course using the Object Oriented (OO) programming is the best way to provide an access to variables through the access methods. But if you are not yet ready for OO techniques you can still benefit from using the techniques we are going to talk about. =head2 Making Variables Global When you first wrote C<$x> in your code you created a (package) global variable. It is visible everywhere in your program, although if used in a package other than the package in which it was declared (C<main::> by default), it must be referred to with its fully qualified name, unless you have imported this variable with import(). This will work only if you do not use C<strict> pragma; but you I<have> to use this pragma if you want to run your scripts under mod_perl. Read L<The strict pragma|guide::porting/The_strict_pragma> to find out why. =head2 Making Variables Global With strict Pragma On First you use : use strict; Then you use: use vars qw($scalar %hash @array); This declares the named variables as package globals in the current package. They may be referred to within the same file and package with their unqualified names; and in different files/packages with their fully qualified names. With perl5.6 you can use the C<our> operator instead: our($scalar, %hash, @array); If you want to share package global variables between packages, here is what you can do. =head2 Using Exporter.pm to Share Global Variables Assume that you want to share the C<CGI.pm> object (I will use C<$q>) between your modules. For example, you create it in C<script.pl>, but you want it to be visible in C<My::HTML>. First, you make C<$q> global. script.pl: ---------------- use vars qw($q); use CGI; use lib qw(.); use My::HTML qw($q); # My/HTML.pm is in the same dir as script.pl $q = CGI->new; My::HTML::printmyheader(); Note that we have imported C<$q> from C<My::HTML>. And C<My::HTML> does the export of C<$q>: My/HTML.pm ---------------- package My::HTML; use strict; BEGIN { use Exporter (); @My::HTML::ISA = qw(Exporter); @My::HTML::EXPORT = qw(); @My::HTML::EXPORT_OK = qw($q); } use vars qw($q); sub printmyheader{ # Whatever you want to do with $q... e.g. print $q->header(); } 1; So the C<$q> is shared between the C<My::HTML> package and C<script.pl>. It will work vice versa as well, if you create the object in C<My::HTML> but use it in C<script.pl>. You have true sharing, since if you change C<$q> in C<script.pl>, it will be changed in C<My::HTML> as well. What if you need to share C<$q> between more than two packages? For example you want My::Doc to share C<$q> as well. You leave C<My::HTML> untouched, and modify I<script.pl> to include: use My::Doc qw($q); Then you add the same C<Exporter> code that we used in C<My::HTML>, into C<My::Doc>, so that it also exports C<$q>. One possible pitfall is when you want to use C<My::Doc> in both C<My::HTML> and I<script.pl>. Only if you add use My::Doc qw($q); into C<My::HTML> will C<$q> be shared. Otherwise C<My::Doc> will not share C<$q> any more. To make things clear here is the code: script.pl: ---------------- use vars qw($q); use CGI; use lib qw(.); use My::HTML qw($q); # My/HTML.pm is in the same dir as script.pl use My::Doc qw($q); # Ditto $q = new CGI; My::HTML::printmyheader(); My/HTML.pm ---------------- package My::HTML; use strict; BEGIN { use Exporter (); @My::HTML::ISA = qw(Exporter); @My::HTML::EXPORT = qw(); @My::HTML::EXPORT_OK = qw($q); } use vars qw($q); use My::Doc qw($q); sub printmyheader{ # Whatever you want to do with $q... e.g. print $q->header(); My::Doc::printtitle('Guide'); } 1; My/Doc.pm ---------------- package My::Doc; use strict; BEGIN { use Exporter (); @My::Doc::ISA = qw(Exporter); @My::Doc::EXPORT = qw(); @My::Doc::EXPORT_OK = qw($q); } use vars qw($q); sub printtitle{ my $title = shift || 'None'; print $q->h1($title); } 1; =head2 Using the Perl Aliasing Feature to Share Global Variables As the title says you can import a variable into a script or module without using C<Exporter.pm>. I have found it useful to keep all the configuration variables in one module C<My::Config>. But then I have to export all the variables in order to use them in other modules, which is bad for two reasons: polluting other packages' name spaces with extra tags which increases the memory requirements; and adding the overhead of keeping track of what variables should be exported from the configuration module and what imported, for some particular package. I solve this problem by keeping all the variables in one hash C<%c> and exporting that. Here is an example of C<My::Config>: package My::Config; use strict; use vars qw(%c); %c = ( # All the configs go here scalar_var => 5, array_var => [qw(foo bar)], hash_var => { foo => 'Foo', bar => 'BARRR', }, ); 1; Now in packages that want to use the configuration variables I have either to use the fully qualified names like C<$My::Config::test>, which I dislike or import them as described in the previous section. But hey, since we have only one variable to handle, we can make things even simpler and save the loading of the C<Exporter.pm> package. We will use the Perl aliasing feature for exporting and saving the keystrokes: package My::HTML; use strict; use lib qw(.); # Global Configuration now aliased to global %c use My::Config (); # My/Config.pm in the same dir as script.pl use vars qw(%c); *c = \%My::Config::c; # Now you can access the variables from the My::Config print $c{scalar_var}; print $c{array_var}[0]; print $c{hash_var}{foo}; Of course $c is global everywhere you use it as described above, and if you change it somewhere it will affect any other packages you have aliased C<$My::Config::c> to. Note that aliases work either with global or C<local()> vars - you cannot write: my *c = \%My::Config::c; # ERROR! Which is an error. But you can write: local *c = \%My::Config::c; For more information about aliasing, refer to the Camel book, second edition, pages 51-52. =head2 Using Non-Hardcoded Configuration Module Names You have just seen how to use a configuration module for configuration centralization and an easy access to the information stored in this module. However, there is somewhat of a chicken-and-egg problem--how to let your other modules know the name of this file? Hardcoding the name is brittle--if you have only a single project it should be fine, but if you have more projects which use different configurations and you will want to reuse their code you will have to find all instances of the hardcoded name and replace it. Another solution could be to have the same name for a configuration module, like C<My::Config> but putting a different copy of it into different locations. But this won't work under mod_perl because of the namespace collision. You cannot load different modules which uses the same name, only the first one will be loaded. Luckily, there is another solution which allows us to stay flexible. C<PerlSetVar> comes to rescue. Just like with environment variables, you can set server's global Perl variables which can be retrieved from any module and script. Those statements are placed into the I<httpd.conf> file. For example PerlSetVar FooBaseDir /home/httpd/foo PerlSetVar FooConfigModule Foo::Config Now we require() the file where the above configuration will be used. PerlRequire /home/httpd/perl/startup.pl In the I<startup.pl> we might have the following code: # retrieve the configuration module path use Apache; my $s = Apache->server; my $base_dir = $s->dir_config('FooBaseDir') || ''; my $config_module = $s->dir_config('FooConfigModule') || ''; die "FooBaseDir and FooConfigModule aren't set in httpd.conf" unless $base_dir and $config_module; # build the real path to the config module my $path = "$base_dir/$config_module"; $path =~ s|::|/|; $path .= ".pm"; # we have something like "/home/httpd/foo/Foo/Config.pm" # now we can pull in the configuration module require $path; Now we know the module name and it's loaded, so for example if we need to use some variables stored in this module to open a database connection, we will do: Apache::DBI->connect_on_init ("DBI:mysql:${$config_module.'::DB_NAME'}::${$config_module.'::SERVER'}", ${$config_module.'::USER'}, ${$config_module.'::USER_PASSWD'}, { PrintError => 1, # warn() on errors RaiseError => 0, # don't die on error AutoCommit => 1, # commit executes immediately } ); Where variable like: ${$config_module.'::USER'} In our example are really: $Foo::Config::USER If you want to access these variable from within your code at the run time, instead accessing to the server object C<$c>, use the request object C<$r>: my $r = shift; my $base_dir = $r->dir_config('FooBaseDir') || ''; my $config_module = $r->dir_config('FooConfigModule') || ''; =head1 The Scope of the Special Perl Variables Special Perl variables like C<$|> (buffering), C<$^T> (script's start time), C<$^W> (warnings mode), C<$/> (input record separator), C<$\> (output record separator) and many more are all true global variables; they do not belong to any particular package (not even C<main::>) and are universally available. This means that if you change them, you change them anywhere across the entire program; furthermore you cannot scope them with my(). However you can local()ise them which means that any changes you apply will only last until the end of the enclosing scope. In the mod_perl situation where the child server doesn't usually exit, if in one of your scripts you modify a global variable it will be changed for the rest of the process' life and will affect all the scripts executed by the same process. Therefore localizing these variables is highly recommended, I'd say mandatory. We will demonstrate the case on the input record separator variable. If you undefine this variable, the diamond operator (readline) will suck in the whole file at once if you have enough memory. Remembering this you should never write code like the example below. $/ = undef; # BAD! open IN, "file" .... # slurp it all into a variable $all_the_file = <IN>; The proper way is to have a local() keyword before the special variable is changed, like this: local $/ = undef; open IN, "file" .... # slurp it all inside a variable $all_the_file = <IN>; But there is a catch. local() will propagate the changed value to the code below it. The modified value will be in effect until the script terminates, unless it is changed again somewhere else in the script. A cleaner approach is to enclose the whole of the code that is affected by the modified variable in a block, like this: { local $/ = undef; open IN, "file" .... # slurp it all inside a variable $all_the_file = <IN>; } That way when Perl leaves the block it restores the original value of the C<$/> variable, and you don't need to worry elsewhere in your program about its value being changed here. Note that if you call a subroutine after you've set a global variable but within the enclosing block, the global variable will be visible with its new value inside the subroutine. =head1 Compiled Regular Expressions When using a regular expression that contains an interpolated Perl variable, if it is known that the variable (or variables) will not change during the execution of the program, a standard optimization technique is to add the C</o> modifier to the regex pattern. This directs the compiler to build the internal table once, for the entire lifetime of the script, rather than every time the pattern is executed. Consider: my $pat = '^foo$'; # likely to be input from an HTML form field foreach( @list ) { print if /$pat/o; } This is usually a big win in loops over lists, or when using the C<grep()> or C<map()> operators. In long-lived mod_perl scripts, however, the variable may change with each invocation and this can pose a problem. The first invocation of a fresh httpd child will compile the regex and perform the search correctly. However, all subsequent uses by that child will continue to match the original pattern, regardless of the current contents of the Perl variables the pattern is supposed to depend on. Your script will appear to be broken. There are two solutions to this problem: The first is to use C<eval q//>, to force the code to be evaluated each time. Just make sure that the eval block covers the entire loop of processing, and not just the pattern match itself. The above code fragment would be rewritten as: my $pat = '^foo$'; eval q{ foreach( @list ) { print if /$pat/o; } } Just saying: foreach( @list ) { eval q{ print if /$pat/o; }; } means that we recompile the regex for every element in the list even though the regex doesn't change. You can use this approach if you require more than one pattern match operator in a given section of code. If the section contains only one operator (be it an C<m//> or C<s///>), you can rely on the property of the null pattern, that reuses the last pattern seen. This leads to the second solution, which also eliminates the use of eval. The above code fragment becomes: my $pat = '^foo$'; "something" =~ /$pat/; # dummy match (MUST NOT FAIL!) foreach( @list ) { print if //; } The only gotcha is that the dummy match that boots the regular expression engine must absolutely, positively succeed, otherwise the pattern will not be cached, and the C<//> will match everything. If you can't count on fixed text to ensure the match succeeds, you have two possibilities. If you can guarantee that the pattern variable contains no meta-characters (things like *, +, ^, $...), you can use the dummy match: $pat =~ /\Q$pat\E/; # guaranteed if no meta-characters present If there is a possibility that the pattern can contain meta-characters, you should search for the pattern or the non-searchable \377 character as follows: "\377" =~ /$pat|^\377$/; # guaranteed if meta-characters present Another approach: It depends on the complexity of the regex to which you apply this technique. One common usage where a compiled regex is usually more efficient is to "I<match any one of a group of patterns>" over and over again. Maybe with a helper routine, it's easier to remember. Here is one slightly modified from Jeffery Friedl's example in his book "I<Mastering Regular Expressions>". ##################################################### # Build_MatchMany_Function # -- Input: list of patterns # -- Output: A code ref which matches its $_[0] # against ANY of the patterns given in the # "Input", efficiently. # sub Build_MatchMany_Function { my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\$R[$_]/o" } ( 0..$#R ); my $matchsub = eval "sub { $expr }"; die "Failed in building regex @R: $@" if $@; $matchsub; } Example usage: @some_browsers = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww); $Known_Browser=Build_MatchMany_Function(@some_browsers); while (<ACCESS_LOG>) { # ... $browser = get_browser_field($_); if ( ! &$Known_Browser($browser) ) { print STDERR "Unknown Browser: $browser\n"; } # ... } And of course you can use the qr() operator which makes the code even more efficient: my $pat = '^foo$'; my $re = qr($pat); foreach( @list ) { print if /$re/o; } The qr() operator compiles the pattern for each request and then use the compiled version in the actual match. =head1 Exception Handling for mod_perl Here are some guidelines for S<clean(er)> exception handling in mod_perl, although the technique presented can be applied to all of your Perl programming. The reasoning behind this document is the current broken status of C<$SIG{__DIE__}> in the perl core - see both the perl5-porters and the mod_perl mailing list archives for details on this discussion. (It's broken in at least Perl v5.6.0 and probably in later versions as well). In short summary, $SIG{__DIE__} is a little bit too global, and catches exceptions even when you want to catch them yourself, using an C<eval{}> block. =head2 Trapping Exceptions in Perl To trap an exception in Perl we use the C<eval{}> construct. Many people initially make the mistake that this is the same as the C<eval EXPR> construct, which compiles and executes code at run time, but that's not the case. C<eval{}> compiles at compile time, just like the rest of your code, and has next to zero run-time penalty. For the hardcore C programmers among you, it uses the C<setjmp/longjmp> POSIX routines internally, just like C++ exceptions. When in an eval block, if the code being executed die()'s for any reason, an exception is thrown. This exception can be caught by examining the C<$@> variable immediately after the eval block; if C<$@> is true then an exception occurred and C<$@> contains the exception in the form of a string. The full construct looks like this: eval { # Some code here }; # Note important semi-colon there if ($@) # $@ contains the exception that was thrown { # Do something with the exception } else # optional { # No exception was thrown } Most of the time when you see these exception handlers there is no else block, because it tends to be OK if the code didn't throw an exception. Perl's exception handling is similar to that of other languages, though it may not seem so at first sight: Perl Other language ------------------------------- ------------------------------------ eval { try { # execute here // execute here # raise our own exception: // raise our own exception: die "Oops" if /error/; if(error==1){throw Exception.Oops;} # execute more // execute more } ; } if($@) { catch { # handle exceptions switch( Exception.id ) { if( $@ =~ /Fail/ ) { Fail : fprintf( stderr, "Failed\n" ) ; print "Failed\n" ; break ; } elsif( $@ =~ /Oops/ ) { Oops : throw Exception ; # Pass it up the chain die if $@ =~ /Oops/; } else { default : # handle all other } # exceptions here } } // If we got here all is OK or handled } else { # optional # all is well } # all is well or has been handled =head2 Alternative Exception Handling Techniques An often suggested method for handling global exceptions in mod_perl, and other perl programs in general, is a B<__DIE__> handler, which can be set up by either assigning a function name as a string to C<$SIG{__DIE__}> (not particularly recommended, because of the possible namespace clashes) or assigning a code reference to C<$SIG{__DIE__}>. The usual way of doing so is to use an anonymous subroutine: $SIG{__DIE__} = sub { print "Eek - we died with:\n", $_[0]; }; The current problem with this is that C<$SIG{__DIE__}> is a global setting in your script, so while you can potentially hide away your exceptions in some external module, the execution of C<$SIG{__DIE__}> is fairly magical, and interferes not just with your code, but with all code in every module you import. Beyond the magic involved, C<$SIG{__DIE__}> actually interferes with perl's normal exception handling mechanism, the C<eval{}> construct. Witness: $SIG{__DIE__} = sub { print "handler\n"; }; eval { print "In eval\n"; die "Failed for some reason\n"; }; if ($@) { print "Caught exception: $@"; } The code unfortunately prints out: In eval handler Which isn't quite what you would expect, especially if that C<$SIG{__DIE__}> handler is hidden away deep in some other module that you didn't know about. There are work arounds however. One is to localize C<$SIG{__DIE__}> in every exception trap you write: eval { local $SIG{__DIE__}; ... }; Obviously this just doesn't scale - you don't want to be doing that for every exception trap in your code, and it's a slow down. A second work around is to check in your handler if you are trying to catch this exception: $SIG{__DIE__} = sub { die $_[0] if $^S; print "handler\n"; }; However this won't work under C<Apache::Registry> - you're always in an eval block there! The other problem with C<$SIG{__DIE__}> also relates to its global nature. Because you might have more than one application running under mod_perl, you can't be sure which has set a C<$SIG{__DIE__}> handler when and for what. This can become extremely confusing when you start scaling up from a set of simple registry scripts that might rely on CGI::Carp for global exception handling (which uses C<$SIG{__DIE__}> to trap exceptions) to having many applications installed with a variety of exception handling mechanisms in place. You should warn people about this danger of C<$SIG{__DIE__}> and inform them of better ways to code. The following material is an attempt to do just that. =head2 Better Exception Handling The C<eval{}> construct in itself is a fairly weak way to handle exceptions as strings. There's no way to pass more information in your exception, so you have to handle your exception in more than one place - at the location the error occurred, in order to construct a sensible error message, and again in your exception handler to de-construct that string into something meaningful (unless of course all you want your exception handler to do is dump the error to the browser). The other problem is that you have no way of automatically detecting where the exception occurred using C<eval{}> construct. In a C<$SIG{__DIE__}> block you always have the use of the caller() function to detect where the error occurred. But we can fix that... A little known fact about exceptions in perl 5.005 is that you can call die with an object. The exception handler receives that object in C<$@>. This is how you are advised to handle exceptions now, as it provides an extremely flexible and scalable exceptions solution, potentially providing almost all of the power Java exceptions. [As a footnote here, the only thing that is really missing here from Java exceptions is a guaranteed Finally clause, although its possible to get about 98.62% of the way towards providing that using C<eval{}>.] =head3 A Little Housekeeping First though, before we delve into the details, a little housekeeping is in order. Most, if not all, mod_perl programs consist of a main routine that is entered, and then dispatches itself to a routine depending on the parameters passed and/or the form values. In a normal C program this is your main() function, in a mod_perl handler this is your handler() function/method. The exception to this rule seems to be Apache::Registry scripts, although the techniques described here can be easily adapted. In order for you to be able to use exception handling to its best advantage you need to change your script to have some sort of global exception handling. This is much more trivial than it sounds. If you're using C<Apache::Registry> to emulate CGI you might consider wrapping your entire script in one big eval block, but I would discourage that. A better method would be to modularize your script into discrete function calls, one of which should be a dispatch routine: #!/usr/bin/perl -w # Apache::Registry script eval { dispatch(); }; if ($@) { # handle exception } sub dispatch { ... } This is easier with an ordinary mod_perl handler as it is natural to have separate functions, rather than a long run-on script: MyHandler.pm ------------ sub handler { my $r = shift; eval { dispatch($r); }; if ($@) { # handle exception } } sub dispatch { my $r = shift; ... } Now that the skeleton code is setup, let's create an exception class, making use of Perl 5.005's ability to throw exception objects. =head3 An Exception Class This is a really simple exception class, that does nothing but contain information. A better implementation would probably also handle its own exception conditions, but that would be more complex, requiring separate packages for each exception type. My/Exception.pm --------------- package My::Exception; sub AUTOLOAD { no strict 'refs', 'subs'; if ($AUTOLOAD =~ /.*::([A-Z]\w+)$/) { my $exception = $1; *{$AUTOLOAD} = sub { shift; my ($package, $filename, $line) = caller; push @_, caller => { package => $package, filename => $filename, line => $line, }; bless { @_ }, "My::Exception::$exception"; }; goto &{$AUTOLOAD}; } else { die "No such exception class: $AUTOLOAD\n"; } } 1; OK, so this is all highly magical, but what does it do? It creates a simple package that we can import and use as follows: use My::Exception; die My::Exception->SomeException( foo => "bar" ); The exception class tracks exactly where we died from using the caller() mechanism, it also caches exception classes so that C<AUTOLOAD> is only called the first time (in a given process) an exception of a particular type is thrown (particularly relevant under mod_perl). =head2 Catching Uncaught Exceptions What about exceptions that are thrown outside of your control? We can fix this using one of two possible methods. The first is to override die globally using the old magical C<$SIG{__DIE__}>, and the second, is the cleaner non-magical method of overriding the global die() method to your own die() method that throws an exception that makes sense to your application. =head3 Using $SIG{__DIE__} Overloading using C<$SIG{__DIE__}> in this case is rather simple, here's some code: $SIG{__DIE__} = sub { if(!ref($_[0])) { $err = My::Exception->UnCaught(text => join('', @_)); } die $err; }; All this does is catch your exception and re-throw it. It's not as dangerous as we stated earlier that C<$SIG{__DIE__}> can be, because we're actually re-throwing the exception, rather than catching it and stopping there. Even though $SIG{__DIE__} is a global handler, because we are simply re-throwing the exception we can let other applications outside of our control simply catch the exception and not worry about it. There's only one slight buggette left, and that's if some external code die()'ing catches the exception and tries to do string comparisons on the exception, as in: eval { ... # some code die "FATAL ERROR!\n"; }; if ($@) { if ($@ =~ /^FATAL ERROR/) { die $@; } } In order to deal with this, we can overload stringification for our C<My::Exception::UnCaught> class: { package My::Exception::UnCaught; use overload '""' => \&str; sub str { shift->{text}; } } We can now let other code happily continue. Note that there is a bug in Perl 5.6 which may affect people here: Stringification does not occur when an object is operated on by a regular expression (via the =~ operator). A work around is to explicitly stringify using qq double quotes, however that doesn't help the poor soul who is using other applications. This bug has been fixed in later versions of Perl. =head3 Overriding the Core die() Function So what if we don't want to touch C<$SIG{__DIE__}> at all? We can overcome this by overriding the core die function. This is slightly more complex than implementing a C<$SIG{__DIE__}> handler, but is far less magical, and is the right thing to do, according to the L<perl5-porters mailing list|guide::help/Get_help_with_Perl>. Overriding core functions has to be done from an external package/module. So we're going to add that to our C<My::Exception> module. Here's the relevant parts: use vars qw/@ISA @EXPORT/; use Exporter; @EXPORT = qw/die/; @ISA = 'Exporter'; sub die (@); # prototype to match CORE::die sub import { my $pkg = shift; $pkg->export('CORE::GLOBAL', 'die'); Exporter::import($pkg,@_); } sub die (@) { if (!ref($_[0])) { CORE::die My::Exception->UnCaught(text => join('', @_)); } CORE::die $_[0]; # only use first element because its an object } That wasn't so bad, was it? We're relying on Exporter's export() function to do the hard work for us, exporting the die() function into the C<CORE::GLOBAL> namespace. If we don't want to overload die() everywhere this can still be an extremely useful technique. By just using Exporter's default import() method we can export our new die() method into any package of our choosing. This allows us to short-cut the long calling convention and simply die() with a string, and let the system handle the actual construction into an object for us. Along with the above overloaded stringification, we now have a complete exception system (well, mostly complete. Exception die-hards would argue that there's no "finally" clause, and no exception stack, but that's another topic for another time). =head2 A Single UnCaught Exception Class Until the Perl core gets its own base exception class (which will likely happen for Perl 6, but not sooner), it is vitally important that you decide upon a single base exception class for all of the applications that you install on your server, and a single exception handling technique. The problem comes when you have multiple applications all doing exception handling and all expecting a certain type of "UnCaught" exception class. Witness the following application: package Foo; eval { # do something } if ($@) { if ([EMAIL PROTECTED]>isa('Foo::Exception::Bar')) { # handle "Bar" exception } elsif ([EMAIL PROTECTED]>isa('Foo::Exception::UnCaught')) { # handle uncaught exceptions } } All will work well until someone installs application "TrapMe" on the same machine, which installs its own UnCaught exception handler, overloading CORE::GLOBAL::die or installing a $SIG{__DIE__} handler. This is actually a case where using $SIG{__DIE__} might actually be preferable, because you can change your handler() routine to look like this: sub handler { my $r = shift; local $SIG{__DIE__}; Foo::Exception->Init(); # sets $SIG{__DIE__} eval { dispatch($r); }; if ($@) { # handle exception } } sub dispatch { my $r = shift; ... } In this case the very nature of $SIG{__DIE__} being a lexical variable has helped us, something we couldn't fix with overloading CORE::GLOBAL::die. However there is still a gotcha. If someone has overloaded die() in one of the applications installed on your mod_perl machine, you get the same problems still. So in short: Watch out, and check the source code of anything you install to make sure it follows your exception handling technique, or just uses die() with strings. =head2 Some Uses I'm going to come right out and say now: I abuse this system horribly! I throw exceptions all over my code, not because I've hit an "exceptional" bit of code, but because I want to get straight back out of the current call stack, without having to have every single level of function call check error codes. One way I use this is to return Apache return codes: # paranoid security check die My::Exception->RetCode(code => 204); Returns a 204 error code (C<HTTP_NO_CONTENT>), which is caught at my top level exception handler: if ([EMAIL PROTECTED]>isa('My::Exception::RetCode')) { return [EMAIL PROTECTED]>{code}; } That last return statement is in my handler() method, so that's the return code that Apache actually sends. I have other exception handlers in place for sending Basic Authentication headers and Redirect headers out. I also have a generic C<My::Exception::OK> class, which gives me a way to back out completely from where I am, but register that as an OK thing to do. Why do I go to these extents? After all, code like slashcode (the code behind http://slashdot.org) doesn't need this sort of thing, so why should my web site? Well it's just a matter of scalability and programmer style really. There's a lot of literature out there about exception handling, so I suggest doing some research. =head2 Conclusions Here I've demonstrated a simple and scalable (and useful) exception handling mechanism, that fits perfectly with your current code, and provides the programmer with an excellent means to determine what has happened in his code. Some users might be worried about the overhead of such code. However in use I've found accessing the database to be a much more significant overhead, and this is used in some code delivering to thousands of users. For similar exception handling techniques, see the section "L<Other Implementations|guide::perl/Other_Implementations>". =head2 The My::Exception class in its entirety package My::Exception use vars qw/@ISA @EXPORT $AUTOLOAD/; use Exporter; @ISA = 'Exporter'; @EXPORT = qw/die/; sub die (@); sub import { my $pkg = shift; # allow "use My::Exception 'die';" to mean import locally only $pkg->export('CORE::GLOBAL', 'die') unless @_; Exporter::import($pkg,@_); } sub die (@) { if (!ref($_[0])) { CORE::die My::Exception->UnCaught(text => join('', @_)); } CORE::die $_[0]; } { package My::Exception::UnCaught; use overload '""' => sub { shift->{text} } ; } sub AUTOLOAD { no strict 'refs', 'subs'; if ($AUTOLOAD =~ /.*::([A-Z]\w+)$/) { my $exception = $1; *{$AUTOLOAD} = sub { shift; my ($package, $filename, $line) = caller; push @_, caller => { package => $package, filename => $filename, line => $line, }; bless { @_ }, "My::Exception::$exception"; }; goto &{$AUTOLOAD}; } else { die "No such exception class: $AUTOLOAD\n"; } } 1; =head2 Other Implementations Some users might find it very useful to have the more C++/Java like interface of try/catch functions. These are available in several forms that all work in slightly different ways. See the documentation for each module for details: =over =item * Error.pm Graham Barr's excellent OO styled "try, throw, catch" module (from L<CPAN|guide::download/Perl>). This should be considered your best option for structured exception handling because it is well known and well supported and used by a lot of other applications. =item * Exception::Class and Devel::StackTrace by Dave Rolsky both available from CPAN of course. C<Exception::Class> is a bit cleaner than the C<AUTOLOAD> method from above as it can catch typos in exception class names, whereas the method above will automatically create a new class for you. In addition, it lets you create actual class hierarchies for your exceptions, which can be useful if you want to create exception classes that provide extra methods or data. For example, an exception class for database errors could provide a method for returning the SQL and bound parameters in use at the time of the error. =item * Try.pm Tony Olekshy's. Adds an unwind stack and some other interesting features. Not on the CPAN. Available at http://www.avrasoft.com/perl/rfc/try-1136.zip =back =head1 Maintainers Maintainer is the person(s) you should contact with updates, corrections and patches. =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =back =head1 Authors =over =item * Stas Bekman E<lt>stas (at) stason.orgE<gt> =item * Matt Sergeant E<lt>matt (at) sergeant.orgE<gt> =back Only the major authors are listed above. For contributors see the Changes file. =cut
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]