Hello,

I'm having a bit of trouble controlling the fetching list. I have 3
sites that I'd like to fully crawl. I inject the 3 base URLs and add
entries to the regex-urlfilter.txt file to make only URLs from those 3
sites to be fetched. I would like to fetch new pages in moderatly
sized chunks (each "round" takes less than 30 minutes). This makes
development easier and I would think would be useful in production as
well. I first had the "bin/nutch generate" command only add the top
1000 URLs as is done in the example. I may be wrong but it appears
that the rest of the URLs are then thrown out. This isn't necessarily
a problem but sometimes 1 of my 3 sites has _all_ the top 1000 URLs so
after that the other 2 sites don't get crawled at all. If I don't use
the -topN option to generate by the 4th round the fetchlist is very
large and the round takes much longer than my desired 30 minutes.

Thanks in advance,
Tim

btw I'm using nutch 0.6


-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to