Hello, I'm having a bit of trouble controlling the fetching list. I have 3 sites that I'd like to fully crawl. I inject the 3 base URLs and add entries to the regex-urlfilter.txt file to make only URLs from those 3 sites to be fetched. I would like to fetch new pages in moderatly sized chunks (each "round" takes less than 30 minutes). This makes development easier and I would think would be useful in production as well. I first had the "bin/nutch generate" command only add the top 1000 URLs as is done in the example. I may be wrong but it appears that the rest of the URLs are then thrown out. This isn't necessarily a problem but sometimes 1 of my 3 sites has _all_ the top 1000 URLs so after that the other 2 sites don't get crawled at all. If I don't use the -topN option to generate by the 4th round the fetchlist is very large and the round takes much longer than my desired 30 minutes.
Thanks in advance, Tim btw I'm using nutch 0.6 ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
