[Nutch Wiki] Update of "GettingNutchRunningWithWindows" by AledJones

Apache Wiki Wed, 02 Nov 2005 08:41:14 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by AledJones:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

The comment on the change is:
Editors please fine tune as required

------------------------------------------------------------------------------
- Since Nutch is written in Java, it should be possible to get Nutch working in 
a Windows environment, provided the write software is installed.
+ Since Nutch is written in Java, it should be possible to get Nutch working in 
a Windows environment, provided that the correct software is installed.
+ 
+ The following documents how I got it working on Windows XP Pro running Tomcat 
5.28.  
  
  == Java ==
  
  You will need to have Java 1.4.2 or Java 1.5 installed.
  
+ == Cygwin ==
+ 
+ You'll need cygwin to run the shell commands (and if you're prefer using 
linux commands).
+ 
+ 
  == Tomcat ==
  
- TODO
+ You'll need Tomcat 4.* or higher running on your machine.
  
+ == Crawling ==
+ 
+ Download the release and extract anywhere on your hard disk e.g. 
c:\nutch-7.0.1
+ 
+ Create an empty text file in your nutch directory e.g. "urls" and add the 
urls of the sites you want to crawl as shown in the tutorial.
+ 
+ Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. C:\Documents and 
Settings\username).
+ 
+ Follow the tutorial instructions to begin the crawl by entering commands in 
cygwin. Depending on the commands you enter Nutch should create a crawl 
directory and a log file.
+ 
+ For example, if you enter the following command:
+ {{{
+ bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log
+ }}}
+ then a folder called crawled is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.  
From my experience you'll need to delete the crawl.log file before starting the 
crawl off again.
+ 
+ 
+ == Serving ==
+ 
+ In your Environment Variables settings, add NUTCH_JAVA_HOME and the location 
of your JVM (e.g. C:\Sun\AppServer\jdk) as a new Environment Variable
+ 
+ Open up a web browser and navigate to the Tomcat webapps manager (e.g. 
http://localhost:8080/manager/html) and upload the WAR file to the context.
+ 
+ If a root context already exists, undeploy it.
+ 
+ You now need to create a context fragment file so that the root url points to 
your nutch webapp.
+ Navigate to your [tomcat_home]/conf/Catalina/localhost/ and put it there.
+ Create a new xml file (name it the same as the webapp?) e.g. nutch-0.7.1.xml 
and add something like the following line to it
+ {{{
+ :<Context path="" debug="5" privileged="true" 
docBase="[tomcat_home]\webapps\nutch-0.7.1"/>
+ }}}
+ 
+ Next, navigate to your nutch webapp folder then WEB-INF/classes.
+ Edit the nutch-site.xml file and add the following to it:
+ 
+ {{{
+ <nutch-conf>
+ <property>
+     <name>searcher.dir</name>
+     <value>your_crawled_folder_here</value>
+   </property>
+ </nutch-conf>
+ }}}
+ 
+ For example, if your nutch directory resides at C:\nutch-0.7.1 and you 
specified crawled as the directory after the -dir command, then enter 
C:\nutch-0.7.1\crawled\ instead of your_crawled_folder_here.
+ 
+ Restart Tomcat using the windows services tool, open up a browser and enter 
the url http://localhost:8080.  The nutch search page should appear.  As long 
as you've defined the correct location of your nutch index directory as shown 
above then clicking search should yield results.
+

[Nutch Wiki] Update of "GettingNutchRunningWithWindows" by AledJones

Reply via email to