Re: Nutch indexes less pages, then it fetches

caezar Wed, 28 Oct 2009 07:14:11 -0700

I've compared the segments data of the URL which have no redirect and was
indexed correctly, with this "bad" URL, and there is really a difference.
First one have db record in the segment:
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 28 16:01:05 EET 2009
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1256738472613
 
But second one have no such record, which seems pretty fine: it was not
added to the segment on generate stage, it was added on the fetch stage. Is
this a bug in Nutch? Or I'm missing some configuration option?


caezar wrote:
> 
> I'm pretty sure that I ran both commands before indexing
> 
> Andrzej Bialecki wrote:
>> 
>> caezar wrote:
>>> Some more information. Debugging reduce method I've noticed, that before
>>> code
>>>     if (fetchDatum == null || dbDatum == null
>>>         || parseText == null || parseData == null) {
>>>       return;                                     // only have inlinks
>>>     }
>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>> null. Thats why it's skipped :) 
>>> Any ideas about the reason?
>> 
>> Yes - you should run updatedb with this segment, and also run 
>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>> db status won't be updated properly.
>> 
>> 
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

Reply via email to