Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by AledJones: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: Editors please fine tune as required ------------------------------------------------------------------------------ - Since Nutch is written in Java, it should be possible to get Nutch working in a Windows environment, provided the write software is installed. + Since Nutch is written in Java, it should be possible to get Nutch working in a Windows environment, provided that the correct software is installed. + + The following documents how I got it working on Windows XP Pro running Tomcat 5.28. == Java == You will need to have Java 1.4.2 or Java 1.5 installed. + == Cygwin == + + You'll need cygwin to run the shell commands (and if you're prefer using linux commands). + + == Tomcat == - TODO + You'll need Tomcat 4.* or higher running on your machine. + == Crawling == + + Download the release and extract anywhere on your hard disk e.g. c:\nutch-7.0.1 + + Create an empty text file in your nutch directory e.g. "urls" and add the urls of the sites you want to crawl as shown in the tutorial. + + Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. C:\Documents and Settings\username). + + Follow the tutorial instructions to begin the crawl by entering commands in cygwin. Depending on the commands you enter Nutch should create a crawl directory and a log file. + + For example, if you enter the following command: + {{{ + bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log + }}} + then a folder called crawled is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. From my experience you'll need to delete the crawl.log file before starting the crawl off again. + + + == Serving == + + In your Environment Variables settings, add NUTCH_JAVA_HOME and the location of your JVM (e.g. C:\Sun\AppServer\jdk) as a new Environment Variable + + Open up a web browser and navigate to the Tomcat webapps manager (e.g. http://localhost:8080/manager/html) and upload the WAR file to the context. + + If a root context already exists, undeploy it. + + You now need to create a context fragment file so that the root url points to your nutch webapp. + Navigate to your [tomcat_home]/conf/Catalina/localhost/ and put it there. + Create a new xml file (name it the same as the webapp?) e.g. nutch-0.7.1.xml and add something like the following line to it + {{{ + :<Context path="" debug="5" privileged="true" docBase="[tomcat_home]\webapps\nutch-0.7.1"/> + }}} + + Next, navigate to your nutch webapp folder then WEB-INF/classes. + Edit the nutch-site.xml file and add the following to it: + + {{{ + <nutch-conf> + <property> + <name>searcher.dir</name> + <value>your_crawled_folder_here</value> + </property> + </nutch-conf> + }}} + + For example, if your nutch directory resides at C:\nutch-0.7.1 and you specified crawled as the directory after the -dir command, then enter C:\nutch-0.7.1\crawled\ instead of your_crawled_folder_here. + + Restart Tomcat using the windows services tool, open up a browser and enter the url http://localhost:8080. The nutch search page should appear. As long as you've defined the correct location of your nutch index directory as shown above then clicking search should yield results. +
