Hi,

I have same requirement, anybody can offer a solution?
Do I need to recrawl it?

thanks
Ian

------------------------------------------
Hi,

If there an existing method for generating a segment/fetchlist containing
only URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old"
CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched"
status if you run -stats) and in such a situation a person may prefer to
fetch only the yet-unfetched URLs first, and only after that include URLs
that need to be refetched in the newly generated segments.

One can write a custom Generator, or perhaps modify the existing one to add
this option, but is there an existing mechanism for this?

If not, does this sound like something that should be added to the existing
Generator and invoked via a command-line arg, say -unfetchedOnly ?

Thanks,
Otis
--

-- 
View this message in context: 
http://www.nabble.com/Fetching-only-unfetched-URLs-tp18058588p20831431.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to