[Nutch Wiki] Update of "GettingNutchRunningWithWindows" by FelixJoachim

Apache Wiki Thu, 17 Nov 2005 15:08:00 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by FelixJoachim:
http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

------------------------------------------------------------------------------
  
  Create an empty text file in your nutch directory e.g. "urls" and add the 
urls of the sites you want to crawl as shown in the tutorial.
  
+ Add your urls to the crawl-urlfilter.txt (e.g. 
C:\nutch-0.7.1\conf\crawl-urlfilter.txt). An entry could look like this: 
+^http://([a-z0-9]*\.)*apache.org/
+ 
  Load up cygwin and naviagte to your nutch directory.  When cygwin launches 
you'll usually find yourself in your user folder (e.g. C:\Documents and 
Settings\username).
  
  If your workstation needs to go through a windows authentication proxy to get 
to the internet then you can use an application such as the NTLM Authorization 
Proxy Server: [http://www.geocities.com/rozmanov/ntlm/] to get through it.  
You'll then need to edit the nutch-site.xml file to point to the port opened by 
the app.
@@ -34, +36 @@

  {{{
  bin/nutch crawl urls -dir crawled -depth 3 >& crawl.log
  }}}
- then a folder called crawled is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.  
From my experience you'll need to delete the crawl.log file before starting the 
crawl off again.
+ then a folder called crawled is created in your nutch directory, along with 
the crawl.log file.  Use this log file to debug any errors you might have.  
From my experience you'll need to delete the crawled directory before starting 
the crawl off again.
  
  == Serving ==

[Nutch Wiki] Update of "GettingNutchRunningWithWindows" by FelixJoachim

Reply via email to