Re: Nutch indexes less pages, then it fetches

reinhard schwab Wed, 28 Oct 2009 08:10:53 -0700

what is in the crawl db?

reinh...@thord:>bin/nutch readdb  <crawldb> -url <url>



caezar schrieb:
> No, problem is not solved. Everything happens as you described, but page is
> not indexed, because of condition:
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> in IndexerMapReduce code. For this page dbDatum is null, so it is not
> indexed!
>
> reinhard schwab wrote:
>   
>> is your problem solved now???
>>
>> this can be ok.
>> new discovered urls will be added to a segment when fetched documents
>> are parsed and if these urls pass the filters.
>> they will not have a crawl datum Generate because they are unknown until
>> they are extracted.
>>
>> regards
>>
>> caezar schrieb:
>>     
>>> I've compared the segments data of the URL which have no redirect and was
>>> indexed correctly, with this "bad" URL, and there is really a difference.
>>> First one have db record in the segment:
>>> Crawl Generate::
>>> Version: 7
>>> Status: 1 (db_unfetched)
>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: _ngt_: 1256738472613
>>>  
>>> But second one have no such record, which seems pretty fine: it was not
>>> added to the segment on generate stage, it was added on the fetch stage.
>>> Is
>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>
>>> caezar wrote:
>>>   
>>>       
>>>> I'm pretty sure that I ran both commands before indexing
>>>>
>>>> Andrzej Bialecki wrote:
>>>>     
>>>>         
>>>>> caezar wrote:
>>>>>       
>>>>>           
>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>> before
>>>>>> code
>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>         || parseText == null || parseData == null) {
>>>>>>       return;                                     // only have inlinks
>>>>>>     }
>>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>>> is
>>>>>> null. Thats why it's skipped :) 
>>>>>> Any ideas about the reason?
>>>>>>         
>>>>>>             
>>>>> Yes - you should run updatedb with this segment, and also run 
>>>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>>>> db status won't be updated properly.
>>>>>
>>>>>
>>>>> -- 
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>

Re: Nutch indexes less pages, then it fetches

Reply via email to