Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by FelixJoachim: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows ------------------------------------------------------------------------------ Create an empty text file in your nutch directory e.g. "urls" and add the urls of the sites you want to crawl as shown in the tutorial. + Add your urls to the crawl-urlfilter.txt (e.g. C:\nutch-0.7.1\conf\crawl-urlfilter.txt). An entry could look like this: +^http://([a-z0-9]*\.)*apache.org/ + Load up cygwin and naviagte to your nutch directory. When cygwin launches you'll usually find yourself in your user folder (e.g. C:\Documents and Settings\username). If your workstation needs to go through a windows authentication proxy to get to the internet then you can use an application such as the NTLM Authorization Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it. You'll then need to edit the nutch-site.xml file to point to the port opened by the app. @@ -34, +36 @@ {{{ bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log }}} - then a folder called crawled is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. From my experience you'll need to delete the crawl.log file before starting the crawl off again. + then a folder called crawled is created in your nutch directory, along with the crawl.log file. Use this log file to debug any errors you might have. From my experience you'll need to delete the crawled directory before starting the crawl off again. == Serving ==
