[Nutch-general] Re: prioritizing newly injected urls for fetching

Piotr Kosiorowski Wed, 27 Jul 2005 05:19:03 -0700

Hello Kamil,

Do you want to generate a fetchlist with urls that are present in WebDBbut where not fetched till now?

I am not sure what you are trying to achive but, you can generate anyfetchlist you want using latest tool by Andrzej Bialecki(http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself).There was also (some time ago) discussion on the nutch mailing listabout refetchonly param for fetchlist generator - some ideas are stillnot implemented but you can read how it works currently.

Regards
Piotr


Kamil Wnuk wrote:

Hi All,

I have recently started using nutch and I am looking for a method of
prioritizing urls injected during an ongoing crawl process (similar to
the "whole-web crawl" scenario described in the tutorial) so that they
are guaranteed to be included at the top of the next fetchlist
generated.  The purpose of this is so that I can give nutch the urls
of newly created web pages that I want indexed as quickly as possible.

I have looked through the nutch documentation and the mailing list
archives and have not been able to find a solution.  Does a good
method for doing this exist?

Thanks in advance,
Kamil




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: prioritizing newly injected urls for fetching

Reply via email to