recrawl question

Nancy Snyder Mon, 11 Dec 2006 08:36:29 -0800

Hi
 I am using nutch-0.8.1 and copied the recrawl script from the web.

I did a simple crawl on url http://www.saic.com at depth 2 with -topN100 and got 18 records.But when I do a recrawl with -topN 100 and -adddays 31 (forcing allpages to be refetched), Iget 132 documents. The initial crawl is fast. And then I do arecrawl (just for testing purposes) and

it takes alot longer and I get lots more documents.

My initial crawl command was:

/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl/opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100


and the url file is called nutch with
   http://www.saic.com
in it.

And the recrawl command:

/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100-adddays 31

I notice in the initial crawl log file, the fetching starts with theoriginal url for the crawl.

>> Generator: starting <<

>> Generator: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806<<

>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<

>> Fetcher: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806<<

>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<

But the recrawl starts with a different url.
>> Generator: starting <<

>> Generator: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350<< >> Generator: Selecting best-scoring urls due for fetch. <<>> Generator: Partitioning selected urls by host, for politeness. << >>Generator: done. <<>> ** segment ='/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350'** <<

>> Fetcher: starting <<

>> Fetcher: segment:/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350<<

>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<

Shouldn't the first fetched url be the same to get the same results?

Plus when the crawls are done, the initial crawl had two segmentdirectories under crawl/segments:

[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/

But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/  20061211105226/

And if I force a recrawl of everything (just to test it out), shouldn'tit get the same number of documents and segment directories??


NANCY

recrawl question

Reply via email to