Re: Fetching only unfetched URLs

John Martyniak Thu, 04 Dec 2008 10:19:21 -0800

Ian,

I am pretty new to Nutch myself. And I think that the unfetched URLsare what Dennis is going to look into.

There are two main ways of fetching URLs, one is to use bin/fetchcrawl which handles all of the individual steps of getting URLs.

The second way is to run the bin/nutch generate, bin/nutch fetch, bin/nutch updatedb, bin/nutch index commands to do all of the steps byhand (or by program). I think that they all run the same stuff, butthe running the commands individually is better suited to Whole WebCrawling, whereas the bin/nutch crawl is better suited for intranet orenterprise crawling.


Hope this helps.

-John


On Dec 3, 2008, at 9:15 AM, Ian.huang wrote:

hi, John

I am a newbie of nutch.
Can you tell me, How to deal with un-fetched url? If I run a recrawlscript, will un-fetched urls be handled? How about other fetchedurl? Will them updated or refetch as well?
Is generate-fetch-update methodology means to run a new crawler andmerge with older one?
Thanks
ian

--------------------------------------------------
From: "John Martyniak" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 2:01 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs
I think that this would be another good piece of functionality. AsI would like to continue to use the generate-fetch-updatemethodology but would like to mimic the functionality of Crawl, inthat I can grab every page at a specific domain.
-John

On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
Otis Gospodnetic wrote:
Hi,
If there an existing method for generating a segment/fetchlistcontaining only URLs that have not yet been fetched?I'm asking because I can imagine a situation where one has alarge and "old" CrawlDb that "knows" about a lot of URLs (theones with "db_unfetched" status if you run -stats) and in such asituation a person may prefer to fetch only the yet-unfetchedURLs first, and only after that include URLs that need to berefetched in the newly generated segments.
I don't think a current method exists to do only unfetched URLs,but it does sound like an interesting bit of functionality.
One can write a custom Generator, or perhaps modify the existingone to add this option, but is there an existing mechanism forthis?
Generator would probably be best, let me look into what it wouldtake to do this. Maybe we can get it into 1.0.
Dennis
If not, does this sound like something that should be added tothe existing Generator and invoked via a command-line arg, say -unfetchedOnly ?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Fetching only unfetched URLs

Reply via email to