I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully stop it (I don't expect it to complete by then). Is there a way to do this? I want it to stop crawling then build the lucene index. Note that I used a simple nutch crawl command, rather than the "whole web" crawling methodology:
nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10 Or is it better to use the -topN option? Some documentation for topN: http://www.mail-archive.com/[email protected]/msg03916.html "You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can be fetched is topN*depth." Or from the tutorial: -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. For example, a typical call might be: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources. ____________________________________________________________________________________ Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545469
