Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MiddleForkMaps: http://wiki.apache.org/nutch/GettingNutchRunningWithDebian ------------------------------------------------------------------------------ ''JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.10''[[BR]] ''export JAVA_HOME''[[BR]] - == Install Tomcat5.5 == + == Install Tomcat5.5 and Verify that it is functioning == ''# apt-get install tomcat5.5 libtomcat5.5-java tomcat5.5-admin tomcat5.5-web''[[BR]] Verify Tomcat is running:[[BR]] ''# /etc/init.d/tomcat5.5 status''[[BR]] @@ -23, +23 @@ ''# /etc/init.d/tomcat5.5 start''[[BR]] ''# /etc/init.d/tomcat5.5 stop''[[BR]] '''It is NOT necessary to run ''''~/local/tomcat/bin/catalina.sh start'''' as noted elsewhere in the WIKI, nor is it necessary to start tomcat/catalina from any particular location'''[[BR]] + Tomcat5.5 under Debian Etch listens to port 8180, not 8080, so pointing your browser to http://mysite:8180 will bring up the Tomcat home page, if everything is functioning properly.[[BR]] + === Grant Yourself Tomcat Manager Permissions === + Edit ''/usr/share/tomcat5.5/conf/tomcat-users.xml'' and include the following:[[BR]] + {{{<user username="myname" password="mypassword" roles="manager"/>}}} + === Enter the Tomcat Manager === + Tomcat5.5 under Debian Etch comes pre-installed with a handfull of simple webapps. Clicking on the ''Tomcat Manager'' link from the Tomcat home page will show you a list of these applications and their execution status. Later we will return to this page to verify that our nutch applications are running. - - == Configure File and Webapp Paths == - Under Debian Etch, the Catalina configuration files are located under '''/etc/tomcat5.5/policy.d''' At runtime they are combined into a single file, ''/usr/share/tomcat5.5/conf/catalina.policy'' Do not edit the latter, as it will be overwrittten.[[BR]] - At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:[[BR]] - - ''grant codeBase "file:/usr/share/tomcat5.5-webapps/-" {[[BR]] - permission java.util.PropertyPermission "user.dir", "read";[[BR]] - permission java.util.PropertyPermission "java.io.tmpdir", "read,write";[[BR]] - permission java.util.PropertyPermission "org.apache.*", "read,execute";[[BR]] - permission java.io.FilePermission "/usr/local/nutch/crawls/-" , "read";[[BR]] - permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read";[[BR]] - permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", "read,write,execute,delete";[[BR]] - permission java.lang.RuntimePermission "createClassLoader", "";[[BR]] - permission java.security.AllPermission;[[BR]] - };[[BR]] - '''Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown'''[[BR]] == Acquire, install and configure Nutch == Acquire a copy of nutch and unpack it in a new directory location. I suggest using /usr/local/nutch as the top-level directory, but this is of course optional[[BR]] === Configure for multiple, independent site crawls and searches === - Follow the section '''Intranet:Configuration''' from the Nutch tutorial at http://lucene.apache.org/nutch/tutorial8.html. However, plan in advance for crawling and searching sites independently from one another:[[BR]] + Follow the section ''Intranet:Configuration'' from the Nutch tutorial at http://lucene.apache.org/nutch/tutorial8.html. However, plan in advance for crawling and searching sites independently from one another:[[BR]] Given two sites, site1 and site2 which you wish to crawl/index (and later search) independently from each other, you may make multiple copies of the conf directory:[[BR]] ''#cd /usr/local/nutch''[[BR]] ''#cp -rp conf conf.site1''[[BR]] @@ -55, +45 @@ ''NUTCH_CONF_DIR=conf.site1''[[BR]] ''export NUTCH_CONF_DIR''[[BR]] ''bin/nutch crawl urls/site1 -dir crawls/site1 -depth 10 -topN 100000''[[BR]] - and the same for site2.[[BR]] + and the same for site2. - Crawl each site:[[BR]] + + === Then proceed to crawl each site: === - ''sh crawl_site1.sh''[[BR]] + ''#sh crawl_site1.sh''[[BR]] - ''sh crawl_site2.sh''[[BR]] + ''#sh crawl_site2.sh''[[BR]] + == Configure Tomcat's File and Webapp Paths == + Under Debian Etch, the Catalina configuration files are located under '''/etc/tomcat5.5/policy.d''' At runtime they are combined into a single file, ''/usr/share/tomcat5.5/conf/catalina.policy'' Do not edit the latter, as it will be overwrittten.[[BR]] + At the end of /etc/tomcat5.5/policy.d/04webapps.policy include the following code:[[BR]] + {{{grant codeBase "file:/usr/share/tomcat5.5-webapps/-\"'' {[[BR]] + permission java.util.PropertyPermission "user.dir", "read";[[BR]] + permission java.util.PropertyPermission "java.io.tmpdir", "read,write";[[BR]] + permission java.util.PropertyPermission "org.apache.*", "read,execute";[[BR]] + permission java.io.FilePermission "/usr/local/nutch/crawls/-" , "read";[[BR]] + permission java.io.FilePermission "/var/lib/tomcat5.5/temp", "read";[[BR]] + permission java.io.FilePermission "/var/lib/tomcat5.5/temp/-", "read,write,execute,delete";[[BR]] + permission java.lang.RuntimePermission "createClassLoader", "";[[BR]] + permission java.security.AllPermission;[[BR]] + };[[BR]]}}} + '''Warning: The last line here was necessary in order to make things work for me. If anybody can supply a more restrictive permission set, please do so!!! The effects of this are unknown'''[[BR]] + == Install Multiple Copies of Nutch under Tomcat5.5 and Prepare for Searching == + Under Debian Etch & Tomcat5.5 the webapps path is located at[[BR]] + ''/usr/share/tomcat5.5-webapps''[[BR]] + '''Contrary to the Nutch tutorial(s) it is NOT NECESSARY to remove the ROOT context - - ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs