Re: Fetching only unfetched URLs

Ian.huang Thu, 04 Dec 2008 13:20:57 -0800

Thank you, Dennis and John

I successfully crawled a website, and got log as follows


2008-12-02 21:31:13,853 INFO  crawl.CrawlDbReader - TOTAL urls: 27274
....

2008-12-02 21:31:13,856 INFO crawl.CrawlDbReader - status 1 (db_unfetched):155682008-12-02 21:31:13,856 INFO crawl.CrawlDbReader - status 2 (db_fetched):4393

In here, I have two difference purposes. I need to recraw those urls whichare not finished(unfetched url) before, also want to inject some new urls tobe unfetched urls.

So, I used recrawl.sh on wiki to do this recrawling job. I noticed the finalstep is merging new generated index with old index. The logs regardingmering index as follows:

2008-12-03 09:41:21,118 INFO indexer.IndexingFilters - Addingorg.apache.nutch.indexer.basic.BasicIndexingFilter

2008-12-03 09:41:21,371 INFO  indexer.Indexer - Optimizing index.
2008-12-03 09:41:22,414 INFO  indexer.Indexer - Indexer: done
2008-12-03 09:41:25,543 INFO  indexer.DeleteDuplicates - Dedup: starting

2008-12-03 09:41:25,620 INFO indexer.DeleteDuplicates - Dedup: addingindexes in: c3/newindexes

2008-12-03 09:41:31,462 INFO  indexer.DeleteDuplicates - Dedup: done

2008-12-03 09:41:34,599 INFO indexer.IndexMerger - merging indexes to:c3/index2008-12-03 09:41:34,618 INFO indexer.IndexMerger - Addingc3/newindexes/part-00000

2008-12-03 09:41:34,833 INFO  indexer.IndexMerger - done merging

I saw a merge-output folder added into index.. But for index data, nothinghappened. In addtional, I am sure that many unfetched urls are fecthed andindexed.


Can you tell me what happended? Am I missing something?

In addtion, Can I set db.max.outlinks.per.page to -1, and generate nounfetched urls during crawling? I do not want any pages are missed :)


Thank you very much

Ian


--------------------------------------------------
From: "Dennis Kubes" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 6:58 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs

It depends what you mean by unfetched url. There are three basic types ofunfetched urls.
1) The new urls that we parse off a webpage during fetching/parsing andthat are added to the CrawlDb2) Redirected urls that are not immediately fetched. If thehttp.redirect.max config variable in nutch-*.xml is set to 0 then anyredirect is queued to be fetched during the next fetching round similar tonew urls we parse off of a webpage.3) Urls that have crossed their fetching expiration date in crawldb andwill be queued for refetching.
In Nutch there really isn't the concept of re-crawling where you wouldupdate *only* certain urls. There are the concepts of fetching. merging,and queuing urls for fetching. When we talk about the generate fetchupdate cycle we are talking about running multiple fetch (i.e. crawl)cycles. Each of these produces a segments. URLs can be parsed from thosesegments and inserted/updated into the CrawlDb The CrawlDb is used togenerate new lists of urls to fetch. And the process starts all overagain.
Segments can be merged (and then indexed) together. The CrawlDb is globalto all segments (although multiple crawldbs can be merged). URLs in theCrawlDb that have been successfully fetched, have a last fetched time anddifferent FetchSchedule implementations determine when the correct time isto re-fetch those urls. URLs that have not been fetched should beavailable for fetching immediately. URLs that have been attempted to befetched and errored are only an increasing scale for when the next fetchattempt should be.
Dennis


Ian.huang wrote:
hi, John

I am a newbie of nutch.
Can you tell me, How to deal with un-fetched url? If I run a recrawlscript, will un-fetched urls be handled? How about other fetched url?Will them updated or refetch as well?
Is generate-fetch-update methodology means to run a new crawler and mergewith older one?
Thanks
ian

--------------------------------------------------
From: "John Martyniak" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 2:01 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs
I think that this would be another good piece of functionality. As Iwould like to continue to use the generate-fetch-update methodology butwould like to mimic the functionality of Crawl, in that I can grabevery page at a specific domain.
-John

On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
Otis Gospodnetic wrote:
Hi,
If there an existing method for generating a segment/fetchlistcontaining only URLs that have not yet been fetched?I'm asking because I can imagine a situation where one has a largeand "old" CrawlDb that "knows" about a lot of URLs (the ones with"db_unfetched" status if you run -stats) and in such a situation aperson may prefer to fetch only the yet-unfetched URLs first, andonly after that include URLs that need to be refetched in the newlygenerated segments.
I don't think a current method exists to do only unfetched URLs, butit does sound like an interesting bit of functionality.
One can write a custom Generator, or perhaps modify the existing oneto add this option, but is there an existing mechanism for this?
Generator would probably be best, let me look into what it would taketo do this. Maybe we can get it into 1.0.
Dennis
If not, does this sound like something that should be added to theexisting Generator and invoked via a command-line arg, say -unfetchedOnly ?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Fetching only unfetched URLs

Reply via email to