Re: recrawl question

Mathijs Homminga Tue, 12 Dec 2006 13:38:03 -0800

Hi Nancy,

Instead of recrawling, you are actually continuing the initial crawl.

I suspect that after your initial crawl you have 18 fetched urls in yourcrawldb. However, there are probably also a lot of unfetched urls (theoutlinks from depth 2).

You can use 'nutch readdb' to inspect your crawldb.
At this moment, you'll have two segments (one for each depth).

With your recrawl command you are telling Nutch to fetch the 100 bestscoring unfetched urls from the crawldb. This might include the 18 urlswhich were fetched in the initial crawl since you used -addays 31, butit will also include a lot of the unfetched outlinks from the initial crawl.The scoring determines which url comes first. If you have not installedyour own scoring plugin, then it uses the OPIC scoring filter. It ispossible (even likely) that your start url is not on top.

The second crawl does one cycle and results in one extra segment. Threein total.


Mathijs


Nancy Snyder wrote:

Hi
 I am using nutch-0.8.1 and copied the recrawl script from the web.
I did a simple crawl on url http://www.saic.com at depth 2 with -topN100 and got 18 records.But when I do a recrawl with -topN 100 and -adddays 31 (forcing allpages to be refetched), Iget 132 documents. The initial crawl is fast. And then I do arecrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.

My initial crawl command was:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl/opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100
and the url file is called nutch with
   http://www.saic.com
in it.

And the recrawl command:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100-adddays 31
I notice in the initial crawl log file, the fetching starts with theoriginal url for the crawl.
>> Generator: starting <<
>> Generator: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806<<
>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<
>> Fetcher: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<

But the recrawl starts with a different url.
>> Generator: starting <<
>> Generator: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350<< >> Generator: Selecting best-scoring urls due for fetch. <<>> Generator: Partitioning selected urls by host, for politeness. <<>> Generator: done. <<>> ** segment ='/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350'** <<
>> Fetcher: starting <<
>> Fetcher: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<

Shouldn't the first fetched url be the same to get the same results?
Plus when the crawls are done, the initial crawl had two segmentdirectories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/

But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/  20061211105226/
And if I force a recrawl of everything (just to test it out),shouldn't it get the same number of documents and segment directories??
NANCY

Re: recrawl question

Reply via email to