I've compared the segments data of the URL which have no redirect and was indexed correctly, with this "bad" URL, and there is really a difference. First one have db record in the segment: Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option?
caezar wrote: > > I'm pretty sure that I ran both commands before indexing > > Andrzej Bialecki wrote: >> >> caezar wrote: >>> Some more information. Debugging reduce method I've noticed, that before >>> code >>> if (fetchDatum == null || dbDatum == null >>> || parseText == null || parseData == null) { >>> return; // only have inlinks >>> } >>> my page has fetchDatum, parseText and parseData not null, but dbDatum is >>> null. Thats why it's skipped :) >>> Any ideas about the reason? >> >> Yes - you should run updatedb with this segment, and also run >> invertlinks with this segment, _before_ trying to index. Otherwise the >> db status won't be updated properly. >> >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >> > > -- View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html Sent from the Nutch - User mailing list archive at Nabble.com.