what is in the crawl db? reinh...@thord:>bin/nutch readdb <crawldb> -url <url>
caezar schrieb: > No, problem is not solved. Everything happens as you described, but page is > not indexed, because of condition: > if (fetchDatum == null || dbDatum == null > || parseText == null || parseData == null) { > return; // only have inlinks > } > in IndexerMapReduce code. For this page dbDatum is null, so it is not > indexed! > > reinhard schwab wrote: > >> is your problem solved now??? >> >> this can be ok. >> new discovered urls will be added to a segment when fetched documents >> are parsed and if these urls pass the filters. >> they will not have a crawl datum Generate because they are unknown until >> they are extracted. >> >> regards >> >> caezar schrieb: >> >>> I've compared the segments data of the URL which have no redirect and was >>> indexed correctly, with this "bad" URL, and there is really a difference. >>> First one have db record in the segment: >>> Crawl Generate:: >>> Version: 7 >>> Status: 1 (db_unfetched) >>> Fetch time: Wed Oct 28 16:01:05 EET 2009 >>> Modified time: Thu Jan 01 02:00:00 EET 1970 >>> Retries since fetch: 0 >>> Retry interval: 2592000 seconds (30 days) >>> Score: 1.0 >>> Signature: null >>> Metadata: _ngt_: 1256738472613 >>> >>> But second one have no such record, which seems pretty fine: it was not >>> added to the segment on generate stage, it was added on the fetch stage. >>> Is >>> this a bug in Nutch? Or I'm missing some configuration option? >>> >>> caezar wrote: >>> >>> >>>> I'm pretty sure that I ran both commands before indexing >>>> >>>> Andrzej Bialecki wrote: >>>> >>>> >>>>> caezar wrote: >>>>> >>>>> >>>>>> Some more information. Debugging reduce method I've noticed, that >>>>>> before >>>>>> code >>>>>> if (fetchDatum == null || dbDatum == null >>>>>> || parseText == null || parseData == null) { >>>>>> return; // only have inlinks >>>>>> } >>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum >>>>>> is >>>>>> null. Thats why it's skipped :) >>>>>> Any ideas about the reason? >>>>>> >>>>>> >>>>> Yes - you should run updatedb with this segment, and also run >>>>> invertlinks with this segment, _before_ trying to index. Otherwise the >>>>> db status won't be updated properly. >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Andrzej Bialecki <>< >>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>> http://www.sigram.com Contact: info at sigram dot com >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >