Tim Martin wrote:
I first had the "bin/nutch generate" command only add the top
1000 URLs as is done in the example. I may be wrong but it appears
that the rest of the URLs are then thrown out.

They are not thrown out, just delayed until the next generate.

This isn't necessarily
a problem but sometimes 1 of my 3 sites has _all_ the top 1000 URLs so
after that the other 2 sites don't get crawled at all. If I don't use
the -topN option to generate by the 4th round the fetchlist is very
large and the round takes much longer than my desired 30 minutes.

Are you using link analysis? Perhaps it is doing you a disservice by prioritizing one site above the others. Try, in place of the analyze command, setting setting both fetchlist.score.by.link.count and indexer.boost.by.link.count to true. Please tell us how that works for you.


Doug

Reply via email to