Otis Gospodnetic wrote:
Hi,

If there an existing method for generating a segment/fetchlist containing only 
URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old" CrawlDb that 
"knows" about a lot of URLs (the ones with "db_unfetched" status if you run -stats) and 
in such a situation a person may prefer to fetch only the yet-unfetched URLs first, and only after that 
include URLs that need to be refetched in the newly generated segments.


I don't think a current method exists to do only unfetched URLs, but it does sound like an interesting bit of functionality.

One can write a custom Generator, or perhaps modify the existing one to add 
this option, but is there an existing mechanism for this?

Generator would probably be best, let me look into what it would take to do this. Maybe we can get it into 1.0.

Dennis


If not, does this sound like something that should be added to the existing 
Generator and invoked via a command-line arg, say -unfetchedOnly ?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to