Hello Kamil,

Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now?

I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki (http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself). There was also (some time ago) discussion on the nutch mailing list about refetchonly param for fetchlist generator - some ideas are still not implemented but you can read how it works currently.
Regards
Piotr


Kamil Wnuk wrote:
Hi All,

I have recently started using nutch and I am looking for a method of
prioritizing urls injected during an ongoing crawl process (similar to
the "whole-web crawl" scenario described in the tutorial) so that they
are guaranteed to be included at the top of the next fetchlist
generated.  The purpose of this is so that I can give nutch the urls
of newly created web pages that I want indexed as quickly as possible.

I have looked through the nutch documentation and the mailing list
archives and have not been able to find a solution.  Does a good
method for doing this exist?

Thanks in advance,
Kamil


Reply via email to