Re: Nutch indexes less pages, then it fetches

caezar Wed, 28 Oct 2009 08:25:30 -0700

Status: 5 (db_redir_perm) for redirect source
and
Status: 2 (db_fetched) for redirect target


reinhard schwab wrote:
> 
> what is in the crawl db?
> 
> reinh...@thord:>bin/nutch readdb  <crawldb> -url <url>
> 
> 
> caezar schrieb:
>> No, problem is not solved. Everything happens as you described, but page
>> is
>> not indexed, because of condition:
>>     if (fetchDatum == null || dbDatum == null
>>         || parseText == null || parseData == null) {
>>       return;                                     // only have inlinks
>>     }
>> in IndexerMapReduce code. For this page dbDatum is null, so it is not
>> indexed!
>>
>> reinhard schwab wrote:
>>   
>>> is your problem solved now???
>>>
>>> this can be ok.
>>> new discovered urls will be added to a segment when fetched documents
>>> are parsed and if these urls pass the filters.
>>> they will not have a crawl datum Generate because they are unknown until
>>> they are extracted.
>>>
>>> regards
>>>
>>> caezar schrieb:
>>>     
>>>> I've compared the segments data of the URL which have no redirect and
>>>> was
>>>> indexed correctly, with this "bad" URL, and there is really a
>>>> difference.
>>>> First one have db record in the segment:
>>>> Crawl Generate::
>>>> Version: 7
>>>> Status: 1 (db_unfetched)
>>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: _ngt_: 1256738472613
>>>>  
>>>> But second one have no such record, which seems pretty fine: it was not
>>>> added to the segment on generate stage, it was added on the fetch
>>>> stage.
>>>> Is
>>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>>
>>>> caezar wrote:
>>>>   
>>>>       
>>>>> I'm pretty sure that I ran both commands before indexing
>>>>>
>>>>> Andrzej Bialecki wrote:
>>>>>     
>>>>>         
>>>>>> caezar wrote:
>>>>>>       
>>>>>>           
>>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>>> before
>>>>>>> code
>>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>>         || parseText == null || parseData == null) {
>>>>>>>       return;                                     // only have
>>>>>>> inlinks
>>>>>>>     }
>>>>>>> my page has fetchDatum, parseText and parseData not null, but
>>>>>>> dbDatum
>>>>>>> is
>>>>>>> null. Thats why it's skipped :) 
>>>>>>> Any ideas about the reason?
>>>>>>>         
>>>>>>>             
>>>>>> Yes - you should run updatedb with this segment, and also run 
>>>>>> invertlinks with this segment, _before_ trying to index. Otherwise
>>>>>> the 
>>>>>> db status won't be updated properly.
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Best regards,
>>>>>> Andrzej Bialecki     <><
>>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>>
>>>>>>
>>>>>>
>>>>>>       
>>>>>>           
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26096654.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

Reply via email to