I think that this would be another good piece of functionality. As I
would like to continue to use the generate-fetch-update methodology
but would like to mimic the functionality of Crawl, in that I can grab
every page at a specific domain.
-John
On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
Otis Gospodnetic wrote:
Hi,
If there an existing method for generating a segment/fetchlist
containing only URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large
and "old" CrawlDb that "knows" about a lot of URLs (the ones with
"db_unfetched" status if you run -stats) and in such a situation a
person may prefer to fetch only the yet-unfetched URLs first, and
only after that include URLs that need to be refetched in the newly
generated segments.
I don't think a current method exists to do only unfetched URLs, but
it does sound like an interesting bit of functionality.
One can write a custom Generator, or perhaps modify the existing
one to add this option, but is there an existing mechanism for this?
Generator would probably be best, let me look into what it would
take to do this. Maybe we can get it into 1.0.
Dennis
If not, does this sound like something that should be added to the
existing Generator and invoked via a command-line arg, say -
unfetchedOnly ?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch