Re: Fetching only unfetched URLs

Ian.huang Thu, 04 Dec 2008 06:17:37 -0800

hi, John

I am a newbie of nutch.

Can you tell me, How to deal with un-fetched url? If I run a recrawl script,will un-fetched urls be handled? How about other fetched url? Will themupdated or refetch as well?

Is generate-fetch-update methodology means to run a new crawler and mergewith older one?


Thanks
ian

--------------------------------------------------
From: "John Martyniak" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 2:01 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs

I think that this would be another good piece of functionality. As Iwould like to continue to use the generate-fetch-update methodology butwould like to mimic the functionality of Crawl, in that I can grab everypage at a specific domain.
-John

On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
Otis Gospodnetic wrote:
Hi,
If there an existing method for generating a segment/fetchlistcontaining only URLs that have not yet been fetched?I'm asking because I can imagine a situation where one has a large and"old" CrawlDb that "knows" about a lot of URLs (the ones with"db_unfetched" status if you run -stats) and in such a situation aperson may prefer to fetch only the yet-unfetched URLs first, and onlyafter that include URLs that need to be refetched in the newlygenerated segments.
I don't think a current method exists to do only unfetched URLs, but itdoes sound like an interesting bit of functionality.
One can write a custom Generator, or perhaps modify the existing one toadd this option, but is there an existing mechanism for this?
Generator would probably be best, let me look into what it would take todo this. Maybe we can get it into 1.0.
Dennis
If not, does this sound like something that should be added to theexisting Generator and invoked via a command-line arg, say -unfetchedOnly ?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Fetching only unfetched URLs

Reply via email to