Hi,

If there an existing method for generating a segment/fetchlist containing only 
URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old" 
CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched" status 
if you run -stats) and in such a situation a person may prefer to fetch only 
the yet-unfetched URLs first, and only after that include URLs that need to be 
refetched in the newly generated segments.

One can write a custom Generator, or perhaps modify the existing one to add 
this option, but is there an existing mechanism for this?

If not, does this sound like something that should be added to the existing 
Generator and invoked via a command-line arg, say -unfetchedOnly ?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to