Interrupting a nutch crawl -- or use topN?

Kai_testing Middleton Fri, 29 Jun 2007 19:10:53 -0700

I am running a nutch crawl of 19 sites.  I wish to let this crawl go for about 
two days then gracefully stop it (I don't expect it to complete by then).  Is 
there a way to do this?  I want it to stop crawling then build the lucene 
index.  Note that I used a simple nutch crawl command, rather than the "whole 
web" crawling methodology:


nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10

Or is it better to use the -topN option?

Some documentation for topN:

http://www.mail-archive.com/[email protected]/msg03916.html

"You can limit the number of pages by using the -topN parameter.  This 
limits the number of pages fetched in each round.  Pages are prioritized 
by how well-linked they are.  The maximum number of pages that can be 
fetched is topN*depth."

Or from the tutorial:


-topN N determines the maximum number of pages that
will be retrieved at each level up to the depth.
For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Typically one starts testing one's configuration by crawling at
shallow depths, sharply limiting the number of pages fetched at each
level (-topN), and watching the output to check that
desired pages are fetched and undesirable pages are not.  Once one is
confident of the configuration, then an appropriate depth for a full
crawl is around 10.  The number of pages per level
(-topN) for a full crawl can be from tens of thousands to
millions, depending on your resources.





       
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. 
Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469

Interrupting a nutch crawl -- or use topN?

Reply via email to