Hi Nancy,

Instead of recrawling, you are actually continuing the initial crawl.

I suspect that after your initial crawl you have 18 fetched urls in your crawldb. However, there are probably also a lot of unfetched urls (the outlinks from depth 2).
You can use 'nutch readdb' to inspect your crawldb.
At this moment, you'll have two segments (one for each depth).

With your recrawl command you are telling Nutch to fetch the 100 best scoring unfetched urls from the crawldb. This might include the 18 urls which were fetched in the initial crawl since you used -addays 31, but it will also include a lot of the unfetched outlinks from the initial crawl. The scoring determines which url comes first. If you have not installed your own scoring plugin, then it uses the OPIC scoring filter. It is possible (even likely) that your start url is not on top.

The second crawl does one cycle and results in one extra segment. Three in total.

Mathijs


Nancy Snyder wrote:
Hi
 I am using nutch-0.8.1 and copied the recrawl script from the web.

I did a simple crawl on url http://www.saic.com at depth 2 with -topN 100 and got 18 records. But when I do a recrawl with -topN 100 and -adddays 31 (forcing all pages to be refetched), I get 132 documents. The initial crawl is fast. And then I do a recrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.

My initial crawl command was:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl /opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100

and the url file is called nutch with
   http://www.saic.com
in it.

And the recrawl command:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100 -adddays 31

I notice in the initial crawl log file, the fetching starts with the original url for the crawl.
>> Generator: starting <<
>> Generator: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 <<
>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<
>> Fetcher: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 <<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<

But the recrawl starts with a different url.
>> Generator: starting <<
>> Generator: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 << >> Generator: Selecting best-scoring urls due for fetch. << >> Generator: Partitioning selected urls by host, for politeness. << >> Generator: done. << >> ** segment = '/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350' ** <<
>> Fetcher: starting <<
>> Fetcher: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 <<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<

Shouldn't the first fetched url be the same to get the same results?

Plus when the crawls are done, the initial crawl had two segment directories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/

But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/  20061211105226/

And if I force a recrawl of everything (just to test it out), shouldn't it get the same number of documents and segment directories??

NANCY

Reply via email to