I first had the "bin/nutch generate" command only add the top 1000 URLs as is done in the example. I may be wrong but it appears that the rest of the URLs are then thrown out.
They are not thrown out, just delayed until the next generate.
This isn't necessarily a problem but sometimes 1 of my 3 sites has _all_ the top 1000 URLs so after that the other 2 sites don't get crawled at all. If I don't use the -topN option to generate by the 4th round the fetchlist is very large and the round takes much longer than my desired 30 minutes.
Are you using link analysis? Perhaps it is doing you a disservice by prioritizing one site above the others. Try, in place of the analyze command, setting setting both fetchlist.score.by.link.count and indexer.boost.by.link.count to true. Please tell us how that works for you.
Doug
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
