[Nutch Wiki] Trivial Update of "NutchTutorial" by RichardLloyd

Apache Wiki Sat, 10 Sep 2011 07:52:28 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by RichardLloyd:
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=43&rev2=44

Comment:
Added clarifying note to 3.2.  

  Typically one starts testing one's configuration by crawling at shallow 
depths, sharply limiting the number of pages fetched at each level (-topN), and 
watching the output to check that desired pages are fetched and undesirable 
pages are not. Once one is confident of the configuration, then an appropriate 
depth for a full crawl is around 10. The number of pages per level (-topN) for 
a full crawl can be from tens of thousands to millions, depending on your 
resources.
  
  === 3.2 Using Individual Commands for Whole-web Crawling ===
+ '''NOTE''': If you previously modified the file conf/regex-urlfilter.txt as 
as covered [[#A3._Crawl_your_first_website|here]] you will need to change it 
back.
+ 
  Whole-web crawling is designed to handle very large crawls which may take 
weeks to complete, running on multiple machines.  This also permits more 
control over the crawl process, and incremental crawling.  It is important to 
note that whole web crawling does not necessarily mean crawling the entire 
world wide web.  We can limit a whole web crawl to just a list of the URLs we 
want to crawl.  This is done by using a filter just like we the one we used 
when we did the crawl command (above).
  
  ==== Step-by-Step: Concepts ====

[Nutch Wiki] Trivial Update of "NutchTutorial" by RichardLloyd

Reply via email to