Re: Nutch indexes less pages, then it fetches

reinhard schwab Wed, 28 Oct 2009 04:34:51 -0700

yes, its permanently redirected.
you can check also the segment status of this url
here is an example


reinh...@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
"http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20";

it will show you whether it is parsed and the extracted outlinks.
it will show any data related to this url stored in the segment.

regards

caezar schrieb:
> Thanks, that was really helpful. I've moved forward but still not found the
> solution.
> So the status of the initial URL
> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
> Status: 5 (db_redir_perm)
> Metadata: _pst_: moved(12), lastModified=0:
> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>
> So it answers the question, why initial page was not indexed - because it
> was redirected.
> Now checking the status of redirect target:
> Status: 2 (db_fetched)
>
> So it was sucessfully fetchet. But, according to indexing log - it still was
> not sent to indexer!
>
>
>
> reinhard schwab wrote:
>   
>> what is the db status of this url in your crawl db?
>> if it is STATUS_DB_NOTMODIFIED,
>> then it may be the reason.
>> (you can check it if you dump your crawl db with
>> reinh...@thord:>bin/nutch readdb  <crawldb> -url <url>
>>
>> it has this status, if it is recrawled and the signature does not change.
>> the signature is MD5 hash of the content.
>>
>> another reason may be that you have some indexing filters.
>> i dont believe its the reason here.
>>
>> regards
>>
>>
>> kevin chen schrieb:
>>     
>>> I have similar experience.
>>>
>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>> Reinhard schwab  at 
>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>
>>> I haven't have chance to try it out yet.
>>>  
>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>   
>>>       
>>>> Hi All,
>>>>
>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>> fetches. For example URL:
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>> I assume that if fetched sucessfully because in fetch logs it mentioned
>>>> only
>>>> once:
>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>
>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>> executed). What could be possible reason? Is there a way to browse
>>>> crawldb
>>>> to ensure that page really fetched? What else could I check?
>>>>
>>>> Thanks
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>

Re: Nutch indexes less pages, then it fetches

Reply via email to