I am running a nutch crawl of 19 sites. I wish to let this crawl go for about
two days then gracefully stop it (I don't expect it to complete by then). Is
there a way to do this? I want it to stop crawling then build the lucene
index. Note that I used a simple nutch crawl command, rather than the "whole
web" crawling methodology:
nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10
Or is it better to use the -topN option?
Some documentation for topN:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03916.html
"You can limit the number of pages by using the -topN parameter. This
limits the number of pages fetched in each round. Pages are prioritized
by how well-linked they are. The maximum number of pages that can be
fetched is topN*depth."
Or from the tutorial:
-topN N determines the maximum number of pages that
will be retrieved at each level up to the depth.
For example, a typical call might be:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Typically one starts testing one's configuration by crawling at
shallow depths, sharply limiting the number of pages fetched at each
level (-topN), and watching the output to check that
desired pages are fetched and undesirable pages are not. Once one is
confident of the configuration, then an appropriate depth for a full
crawl is around 10. The number of pages per level
(-topN) for a full crawl can be from tens of thousands to
millions, depending on your resources.
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general