Rod Taylor wrote:
With -numFetchers gone it appears I require a generate/update for each
fetch which serializes the process.

That's correct. It would be possible to implement something like the former behaviour by (as before) setting page's nextFetch date to a week out when they're added to a fetchlist. But, in mapreduce, dbupdate and generate are much faster, both since the crawldb doesn't have links (and is thus a lot smaller) and the crawldb update is distributed, so the downtime between fetcher cycles is much less and this technique may not be required. Previously dbupdate took nearly as long as fetches, so parallelizing these made a big difference. But now, in my experience, the dbupdate/generate overhead is more like 10-20%. With mapreduce, what percent of the time do you find that you're not fetching?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to