Thank you, Dennis and John
I successfully crawled a website, and got log as follows
2008-12-02 21:31:13,853 INFO crawl.CrawlDbReader - TOTAL urls: 27274
....
2008-12-02 21:31:13,856 INFO crawl.CrawlDbReader - status 1 (db_unfetched):155682008-12-02 21:31:13,856 INFO crawl.CrawlDbReader - status 2 (db_fetched):
4393
In here, I have two difference purposes. I need to recraw those urls which
are not finished(unfetched url) before, also want to inject some new urls to
be unfetched urls.
So, I used recrawl.sh on wiki to do this recrawling job. I noticed the final
step is merging new generated index with old index. The logs regarding
mering index as follows:
2008-12-03 09:41:21,118 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-12-03 09:41:21,371 INFO indexer.Indexer - Optimizing index.
2008-12-03 09:41:22,414 INFO indexer.Indexer - Indexer: done
2008-12-03 09:41:25,543 INFO indexer.DeleteDuplicates - Dedup: starting
2008-12-03 09:41:25,620 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: c3/newindexes
2008-12-03 09:41:31,462 INFO indexer.DeleteDuplicates - Dedup: done
2008-12-03 09:41:34,599 INFO indexer.IndexMerger - merging indexes to:
c3/index
2008-12-03 09:41:34,618 INFO indexer.IndexMerger - Adding
c3/newindexes/part-00000
2008-12-03 09:41:34,833 INFO indexer.IndexMerger - done merging
I saw a merge-output folder added into index.. But for index data, nothing
happened. In addtional, I am sure that many unfetched urls are fecthed and
indexed.
Can you tell me what happended? Am I missing something?
In addtion, Can I set db.max.outlinks.per.page to -1, and generate no
unfetched urls during crawling? I do not want any pages are missed :)
Thank you very much
Ian
--------------------------------------------------
From: "Dennis Kubes" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 6:58 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs
It depends what you mean by unfetched url. There are three basic types of
unfetched urls.
1) The new urls that we parse off a webpage during fetching/parsing and
that are added to the CrawlDb
2) Redirected urls that are not immediately fetched. If the
http.redirect.max config variable in nutch-*.xml is set to 0 then any
redirect is queued to be fetched during the next fetching round similar to
new urls we parse off of a webpage.
3) Urls that have crossed their fetching expiration date in crawldb and
will be queued for refetching.
In Nutch there really isn't the concept of re-crawling where you would
update *only* certain urls. There are the concepts of fetching. merging,
and queuing urls for fetching. When we talk about the generate fetch
update cycle we are talking about running multiple fetch (i.e. crawl)
cycles. Each of these produces a segments. URLs can be parsed from those
segments and inserted/updated into the CrawlDb The CrawlDb is used to
generate new lists of urls to fetch. And the process starts all over
again.
Segments can be merged (and then indexed) together. The CrawlDb is global
to all segments (although multiple crawldbs can be merged). URLs in the
CrawlDb that have been successfully fetched, have a last fetched time and
different FetchSchedule implementations determine when the correct time is
to re-fetch those urls. URLs that have not been fetched should be
available for fetching immediately. URLs that have been attempted to be
fetched and errored are only an increasing scale for when the next fetch
attempt should be.
Dennis
Ian.huang wrote:
hi, John
I am a newbie of nutch.
Can you tell me, How to deal with un-fetched url? If I run a recrawl
script, will un-fetched urls be handled? How about other fetched url?
Will them updated or refetch as well?
Is generate-fetch-update methodology means to run a new crawler and merge
with older one?
Thanks
ian
--------------------------------------------------
From: "John Martyniak" <[EMAIL PROTECTED]>
Sent: Thursday, December 04, 2008 2:01 PM
To: <[email protected]>
Subject: Re: Fetching only unfetched URLs
I think that this would be another good piece of functionality. As I
would like to continue to use the generate-fetch-update methodology but
would like to mimic the functionality of Crawl, in that I can grab
every page at a specific domain.
-John
On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
Otis Gospodnetic wrote:
Hi,
If there an existing method for generating a segment/fetchlist
containing only URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large
and "old" CrawlDb that "knows" about a lot of URLs (the ones with
"db_unfetched" status if you run -stats) and in such a situation a
person may prefer to fetch only the yet-unfetched URLs first, and
only after that include URLs that need to be refetched in the newly
generated segments.
I don't think a current method exists to do only unfetched URLs, but
it does sound like an interesting bit of functionality.
One can write a custom Generator, or perhaps modify the existing one
to add this option, but is there an existing mechanism for this?
Generator would probably be best, let me look into what it would take
to do this. Maybe we can get it into 1.0.
Dennis
If not, does this sound like something that should be added to the
existing Generator and invoked via a command-line arg, say -
unfetchedOnly ?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch