Re: Nutch indexes less pages, then it fetches

reinhard schwab Wed, 28 Oct 2009 02:29:37 -0700

what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinh...@thord:>bin/nutch readdb  <crawldb> -url <url>


it has this status, if it is recrawled and the signature does not change.
the signature is MD5 hash of the content.

another reason may be that you have some indexing filters.
i dont believe its the reason here.

regards


kevin chen schrieb:
> I have similar experience.
>
> Reinhard schwab responded a possible fix.  See mail in this group from
> Reinhard schwab  at 
> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>
> I haven't have chance to try it out yet.
>  
> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>   
>> Hi All,
>>
>> I've got a strange problem, that nutch indexes much less URLs then it
>> fetches. For example URL:
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>> I assume that if fetched sucessfully because in fetch logs it mentioned only
>> once:
>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>
>> But it was not sent to the indexer on indexing phase (I'm using custom
>> NutchIndexWriter and it logs every page for witch it's write method
>> executed). What could be possible reason? Is there a way to browse crawldb
>> to ensure that page really fetched? What else could I check?
>>
>> Thanks
>>     
>
>
>

Re: Nutch indexes less pages, then it fetches

Reply via email to