brian 96/12/08 20:49:14
Modified: htdocs/manual/misc howto.html Log: Obtained from: Rob Hartill, with some stuff by Brian Behlendorf Rob modified the section on robot detection and I modified the section on redirecting an entire server to include a section on mod_rewrite Revision Changes Path 1.2 +102 -86 apache/htdocs/manual/misc/howto.html Index: howto.html =================================================================== RCS file: /export/home/cvs/apache/htdocs/manual/misc/howto.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C3 -r1.1 -r1.2 *** howto.html 1996/12/01 17:07:17 1.1 --- howto.html 1996/12/09 04:49:12 1.2 *************** *** 1,123 **** <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML> <HEAD> <TITLE>Apache HOWTO documentation</TITLE> </HEAD> <BODY> <!--#include virtual="header.html" --> ! <H1>Apache HOWTO documentation</h1> How to: <ul> ! <li><A HREF="#redirect">redirect an entire server or directory</A> <li><A HREF="#logreset">reset your log files</A> ! <li><A HREF="#stoprob">stop robots</A> </ul> ! <hr> ! <H2><A name="redirect">How to redirect an entire server or directory</A></H2> ! One way to redirect all requests for an entire server is to setup a ! <CODE>Redirect</Code> to a <B>cgi script</B> which outputs a 301 or 302 status ! and the location of the other server.<P> ! ! By using a <B>cgi-script</B> you can intercept various requests and treat them ! specially, e.g. you might want to intercept <B>POST</B> requests, so that the ! client isn't redirected to a script on the other server which expects POST ! information (a redirect will lose the POST information.)<P> ! ! Here's how to redirect all requests to a script... In the server configuration ! file, ! <blockquote><code>ScriptAlias / ! /usr/local/httpd/cgi-bin/redirect_script</code></blockquote> ! and here's a simple perl script to redirect ! <blockquote><code> ! #!/usr/local/bin/perl <br> ! <br> ! print "Status: 302 Moved Temporarily\r <br> ! Location: http://www.some.where.else.com/\r\n\r\n"; <br> ! <br> ! </code></blockquote><p><hr> <H2><A name="logreset">How to reset your log files</A></H2> ! Sooner or later, you'll want to reset your log files (access_log and error_log) because they are too big, or full of old information you don't ! need.<p> ! <CODE>access.log</CODE> typically grows by 1Mb for each 10,000 requests.<p> ! Most people's first attempt at replacingthe logfile is to just move the ! logfile or remove the logfile. This doesn't work.<p> ! Apache will continue writing to the logfile at the same offset as before the ! logifile moved. This results in a new logfile being created which is just as big as the old one, but it now contains thousands (or millions) of null ! characters.<p> ! The correct procedure is to move the logfile, then signal Apache to tell it to ! reopen the logfiles.<p> ! Apache is signalled using the <B>SIGHUP</B> (-1) signal. e.g. <blockquote><code> ! mv access_log access_log.old ; kill -1 `cat httpd.pid` </code></blockquote> ! Note: <code>httpd.pid</code> is a file containing the <B>p</B>rocess <B>id</B> of the Apache httpd daemon, Apache saves this in the same directory as the log ! files.<P> ! Many people use this method to replace (and backup) their logfiles on a ! nightly basis.<p><hr> ! <H2><A name="stoprob">How to stop robots</A></H2> ! Ever wondered why so many clients are interested in a file called ! <code>robots.txt</code> which you don't have, and never did have?<p> ! ! These clients are called <B>robots</B> - special automated clients which ! wander around the web looking for interesting resources.<p> ! ! Most robots are used to generate some kind of <em>web index</em> which ! is then used by a <em>search engine</em> to help locate information.<P> ! ! <code>robots.txt</code> provides a means to request that robots limit their ! activities at the site, or more often than not, to leave the site alone.<P> ! ! When the first robots were developed, they had a bad reputation for ! sending hundreds of requests to each site, often resulting in the site ! being overloaded. Things have improved dramatically since then, thanks ! to <A HREF="http://web.nexor.co.uk/mak/doc/robots/guidelines.html"> Guidlines ! for Robot Writers</A>, but even so, some robots may exhibit unfriendly ! behaviour which the webmaster isn't willing to tolerate.<P> ! ! Another reason some webmasters want to block access to robots, results ! from the way in which the information collected by the robots is subsequently ! indexed. <B>There are currently no well used systems to annotate documents ! such that they can be indexed by wandering robots.</B> Hence, the index ! writer will often revert to unsatisfactory algorithms to determine what gets ! indexed.<p> ! ! Typically, indexes are built around text which appears in ! document titles (<TITLE>), or main headings (<H1>), and more ! often than not, the words it indexes on are completely irrelevant or ! misleading for the docuement subject. The worst index is one based on ! every word in the document. This inevitably leads to the search engines ! offering poor suggestions which waste both the users and the servers ! valuable time<P> ! ! So if you decide to exclude robots completely, or just limit the areas ! in which they can roam, set up a <CODE>robots.txt</CODE> file, and refer ! to the <A HREF="http://web.nexor.co.uk/mak/doc/robots/norobots.html">robot ! exclusion documentation</A>.<p> ! ! Much better systems exist to both index your site and publicise its ! resources, e.g. ! <A HREF="http://web.nexor.co.uk/public/aliweb/aliweb.html">ALIWEB</A>, which ! uses site defined index files.<p> <!--#include virtual="footer.html" --> </BODY> --- 1,139 ---- <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML> <HEAD> + <META NAME="description" CONTENT="Some 'how to' tips for the Apache httpd server"> + <META NAME="keywords" CONTENT="apache,redirect,robots,rotate,logfiles"> <TITLE>Apache HOWTO documentation</TITLE> </HEAD> <BODY> <!--#include virtual="header.html" --> ! <H1>Apache HOWTO documentation</H1> How to: <ul> ! <li><A HREF="#redirect">redirect an entire server or directory to a single URL</A> <li><A HREF="#logreset">reset your log files</A> ! <li><A HREF="#stoprob">stop/restrict robots</A> </ul> ! <HR> ! <H2><A name="redirect">How to redirect an entire server or directory to a single URL</A></H2> ! <P>There are two chief ways to redirect all requests for an entire ! server to a single location: one which requires the use of ! <code>mod_rewrite</code>, and another which uses a CGI script. ! <P>First: if all you need to do is migrate a server from one name to ! another, simply use the <code>Redirect</code> directive, as supplied ! by <code>mod_alias</code>: ! <blockquote><pre> ! Redirect / http://www.apache.org/ ! </pre></blockquote> ! ! <P>Since <code>Redirect</code> will forward along the complete path, ! however, it may not be appropriate - for example, when the directory ! structure has changed after the move, and you simply want to direct people ! to the home page. ! ! <P>The best option is to use the standard Apache module <code>mod_rewrite</code>. ! If that module is compiled in, the following lines: ! ! <blockquote><pre>RewriteEngine On ! RewriteRule /.* http://www.apache.org/ [R] ! </pre></blockquote> ! ! This will send an HTTP 302 Redirect back to the client, and no matter ! what they gave in the original URL, they'll be sent to ! "http://www.apache.org". ! ! The second option is to set up a <CODE>ScriptAlias</Code> pointing to ! a <B>cgi script</B> which outputs a 301 or 302 status and the location ! of the other server.</P> ! ! <P>By using a <B>cgi-script</B> you can intercept various requests and ! treat them specially, e.g. you might want to intercept <B>POST</B> ! requests, so that the client isn't redirected to a script on the other ! server which expects POST information (a redirect will lose the POST ! information.) You might also want to use a CGI script if you don't ! want to compile mod_rewrite into your server. ! ! <P>Here's how to redirect all requests to a script... In the server ! configuration file, ! <blockquote><pre>ScriptAlias / /usr/local/httpd/cgi-bin/redirect_script</pre></blockquote> ! ! and here's a simple perl script to redirect requests: ! ! <blockquote><pre> ! #!/usr/local/bin/perl ! ! print "Status: 302 Moved Temporarily\r ! Location: http://www.some.where.else.com/\r\n\r\n"; ! ! </pre></blockquote></P> ! ! <HR> <H2><A name="logreset">How to reset your log files</A></H2> ! <P>Sooner or later, you'll want to reset your log files (access_log and error_log) because they are too big, or full of old information you don't ! need.</P> ! <P><CODE>access.log</CODE> typically grows by 1Mb for each 10,000 requests.</P> ! <P>Most people's first attempt at replacing the logfile is to just move the ! logfile or remove the logfile. This doesn't work.</P> ! <P>Apache will continue writing to the logfile at the same offset as before the ! logfile moved. This results in a new logfile being created which is just as big as the old one, but it now contains thousands (or millions) of null ! characters.</P> ! <P>The correct procedure is to move the logfile, then signal Apache to tell it to reopen the logfiles.</P> ! <P>Apache is signaled using the <B>SIGHUP</B> (-1) signal. e.g. <blockquote><code> ! mv access_log access_log.old<BR> ! kill -1 `cat httpd.pid` </code></blockquote> + </P> ! <P>Note: <code>httpd.pid</code> is a file containing the <B>p</B>rocess <B>id</B> of the Apache httpd daemon, Apache saves this in the same directory as the log ! files.</P> ! ! <P>Many people use this method to replace (and backup) their logfiles on a ! nightly or weekly basis.</P> ! <HR> ! ! <H2><A name="stoprob">How to stop or restrict robots</A></H2> ! ! <P>Ever wondered why so many clients are interested in a file called ! <code>robots.txt</code> which you don't have, and never did have?</P> ! ! <P>These clients are called <B>robots</B> (also known as crawlers, ! spiders and other cute name) - special automated clients which ! wander around the web looking for interesting resources.</P> ! ! <P>Most robots are used to generate some kind of <em>web index</em> which ! is then used by a <em>search engine</em> to help locate information.</P> ! ! <P><code>robots.txt</code> provides a means to request that robots limit their ! activities at the site, or more often than not, to leave the site alone.</P> ! <P>When the first robots were developed, they had a bad reputation for sending hundreds/thousands of requests to each site, often resulting in the site being overloaded. Things have improved dramatically since then, thanks to <A HREF="http://info.webcrawler.com/mak/projects/robots/guidelines.html"> Guidelines for Robot Writers</A>, but even so, some robots may <A HREF="http://www.zyzzyva.com/robots/alert/">exhibit unfriendly behavior</A> which the webmaster isn't willing to tolerate, and will want to stop.</P> ! <P>Another reason some webmasters want to block access to robots, is to ! stop them indexing dynamic information. Many search engines will use the ! data collected from your pages for months to come - not much use if your ! serving stock quotes, news, weather reports or anything else that will be ! stale by the time people find it in a search engine.</P> ! <P>If you decide to exclude robots completely, or just limit the areas ! in which they can roam, create a <CODE>robots.txt</CODE> file; refer ! to the <A HREF="http://info.webcrawler.com/mak/projects/robots/robots.html">robot information pages</A> provided by Martijn Koster for the syntax.</P> <!--#include virtual="footer.html" --> </BODY>