Re: prioritizing newly injected urls for fetching

Piotr Kosiorowski Wed, 27 Jul 2005 05:18:30 -0700

Hello Kamil,

Do you want to generate a fetchlist with urls that are present in WebDBbut where not fetched till now?

I am not sure what you are trying to achive but, you can generate anyfetchlist you want using latest tool by Andrzej Bialecki(http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself).There was also (some time ago) discussion on the nutch mailing listabout refetchonly param for fetchlist generator - some ideas are stillnot implemented but you can read how it works currently.

Regards
Piotr


Kamil Wnuk wrote:

Hi All,

I have recently started using nutch and I am looking for a method of
prioritizing urls injected during an ongoing crawl process (similar to
the "whole-web crawl" scenario described in the tutorial) so that they
are guaranteed to be included at the top of the next fetchlist
generated.  The purpose of this is so that I can give nutch the urls
of newly created web pages that I want indexed as quickly as possible.

I have looked through the nutch documentation and the mailing list
archives and have not been able to find a solution.  Does a good
method for doing this exist?

Thanks in advance,
Kamil

Re: prioritizing newly injected urls for fetching

Reply via email to