Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by RichardLloyd: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=43&rev2=44 Comment: Added clarifying note to 3.2. Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources. === 3.2 Using Individual Commands for Whole-web Crawling === + '''NOTE''': If you previously modified the file conf/regex-urlfilter.txt as as covered [[#A3._Crawl_your_first_website|here]] you will need to change it back. + Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole web crawling does not necessarily mean crawling the entire world wide web. We can limit a whole web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like we the one we used when we did the crawl command (above). ==== Step-by-Step: Concepts ====

