Status: 5 (db_redir_perm) for redirect source and Status: 2 (db_fetched) for redirect target
reinhard schwab wrote: > > what is in the crawl db? > > reinh...@thord:>bin/nutch readdb <crawldb> -url <url> > > > caezar schrieb: >> No, problem is not solved. Everything happens as you described, but page >> is >> not indexed, because of condition: >> if (fetchDatum == null || dbDatum == null >> || parseText == null || parseData == null) { >> return; // only have inlinks >> } >> in IndexerMapReduce code. For this page dbDatum is null, so it is not >> indexed! >> >> reinhard schwab wrote: >> >>> is your problem solved now??? >>> >>> this can be ok. >>> new discovered urls will be added to a segment when fetched documents >>> are parsed and if these urls pass the filters. >>> they will not have a crawl datum Generate because they are unknown until >>> they are extracted. >>> >>> regards >>> >>> caezar schrieb: >>> >>>> I've compared the segments data of the URL which have no redirect and >>>> was >>>> indexed correctly, with this "bad" URL, and there is really a >>>> difference. >>>> First one have db record in the segment: >>>> Crawl Generate:: >>>> Version: 7 >>>> Status: 1 (db_unfetched) >>>> Fetch time: Wed Oct 28 16:01:05 EET 2009 >>>> Modified time: Thu Jan 01 02:00:00 EET 1970 >>>> Retries since fetch: 0 >>>> Retry interval: 2592000 seconds (30 days) >>>> Score: 1.0 >>>> Signature: null >>>> Metadata: _ngt_: 1256738472613 >>>> >>>> But second one have no such record, which seems pretty fine: it was not >>>> added to the segment on generate stage, it was added on the fetch >>>> stage. >>>> Is >>>> this a bug in Nutch? Or I'm missing some configuration option? >>>> >>>> caezar wrote: >>>> >>>> >>>>> I'm pretty sure that I ran both commands before indexing >>>>> >>>>> Andrzej Bialecki wrote: >>>>> >>>>> >>>>>> caezar wrote: >>>>>> >>>>>> >>>>>>> Some more information. Debugging reduce method I've noticed, that >>>>>>> before >>>>>>> code >>>>>>> if (fetchDatum == null || dbDatum == null >>>>>>> || parseText == null || parseData == null) { >>>>>>> return; // only have >>>>>>> inlinks >>>>>>> } >>>>>>> my page has fetchDatum, parseText and parseData not null, but >>>>>>> dbDatum >>>>>>> is >>>>>>> null. Thats why it's skipped :) >>>>>>> Any ideas about the reason? >>>>>>> >>>>>>> >>>>>> Yes - you should run updatedb with this segment, and also run >>>>>> invertlinks with this segment, _before_ trying to index. Otherwise >>>>>> the >>>>>> db status won't be updated properly. >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Andrzej Bialecki <>< >>>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>>> http://www.sigram.com Contact: info at sigram dot com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26096654.html Sent from the Nutch - User mailing list archive at Nabble.com.