Hi
 I am using nutch-0.8.1 and copied the recrawl script from the web.

I did a simple crawl on url http://www.saic.com at depth 2 with -topN 100 and got 18 records. But when I do a recrawl with -topN 100 and -adddays 31 (forcing all pages to be refetched), I get 132 documents. The initial crawl is fast. And then I do a recrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.

My initial crawl command was:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl /opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100

and the url file is called nutch with
   http://www.saic.com
in it.

And the recrawl command:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100 -adddays 31

I notice in the initial crawl log file, the fetching starts with the original url for the crawl.
>> Generator: starting <<
>> Generator: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 <<
>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<
>> Fetcher: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806 <<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<

But the recrawl starts with a different url.
>> Generator: starting <<
>> Generator: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 << >> Generator: Selecting best-scoring urls due for fetch. << >> Generator: Partitioning selected urls by host, for politeness. << >> Generator: done. << >> ** segment = '/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350' ** <<
>> Fetcher: starting <<
>> Fetcher: segment: /opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350 <<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<

Shouldn't the first fetched url be the same to get the same results?

Plus when the crawls are done, the initial crawl had two segment directories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/

But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/  20061211102815/  20061211105226/

And if I force a recrawl of everything (just to test it out), shouldn't it get the same number of documents and segment directories??

NANCY

Reply via email to