Hi Nancy,
Instead of recrawling, you are actually continuing the initial crawl.
I suspect that after your initial crawl you have 18 fetched urls in your
crawldb. However, there are probably also a lot of unfetched urls (the
outlinks from depth 2).
You can use 'nutch readdb' to inspect your crawldb.
At this moment, you'll have two segments (one for each depth).
With your recrawl command you are telling Nutch to fetch the 100 best
scoring unfetched urls from the crawldb. This might include the 18 urls
which were fetched in the initial crawl since you used -addays 31, but
it will also include a lot of the unfetched outlinks from the initial crawl.
The scoring determines which url comes first. If you have not installed
your own scoring plugin, then it uses the OPIC scoring filter. It is
possible (even likely) that your start url is not on top.
The second crawl does one cycle and results in one extra segment. Three
in total.
Mathijs
Nancy Snyder wrote:
Hi
I am using nutch-0.8.1 and copied the recrawl script from the web.
I did a simple crawl on url http://www.saic.com at depth 2 with -topN
100 and got 18 records.
But when I do a recrawl with -topN 100 and -adddays 31 (forcing all
pages to be refetched), I
get 132 documents. The initial crawl is fast. And then I do a
recrawl (just for testing purposes) and
it takes alot longer and I get lots more documents.
My initial crawl command was:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch crawl
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic -dir
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl -depth 2 -topN 100
and the url file is called nutch with
http://www.saic.com
in it.
And the recrawl command:
/opt/webdevel/pfna/nutch-0.8.1/bin/nutch generate
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/crawldb
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments -topN 100
-adddays 31
I notice in the initial crawl log file, the fetching starts with the
original url for the crawl.
>> Generator: starting <<
>> Generator: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806
<<
>> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. << >> Fetcher: starting <<
>> Fetcher: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211102806
<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/ <<
>> Fetcher: done <<
But the recrawl starts with a different url.
>> Generator: starting <<
>> Generator: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350
<< >> Generator: Selecting best-scoring urls due for fetch. <<
>> Generator: Partitioning selected urls by host, for politeness. <<
>> Generator: done. <<
>> ** segment =
'/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350'
** <<
>> Fetcher: starting <<
>> Fetcher: segment:
/opt/webdevel/pfna/lucene_crawls/nsnyder/saic/crawl/segments/20061211103350
<<
>> Fetcher: threads: 10 <<
>> fetching http://www.saic.com/employees/).join( <<
Shouldn't the first fetched url be the same to get the same results?
Plus when the crawls are done, the initial crawl had two segment
directories under crawl/segments:
[EMAIL PROTECTED] segments]$ ls
20061211102806/ 20061211102815/
But the recrawl had three:
[EMAIL PROTECTED] segments]$ ls
20061211102806/ 20061211102815/ 20061211105226/
And if I force a recrawl of everything (just to test it out),
shouldn't it get the same number of documents and segment directories??
NANCY