I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT)
I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: > Hi All, > > I've got a strange problem, that nutch indexes much less URLs then it > fetches. For example URL: > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. > I assume that if fetched sucessfully because in fetch logs it mentioned only > once: > 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm > > But it was not sent to the indexer on indexing phase (I'm using custom > NutchIndexWriter and it logs every page for witch it's write method > executed). What could be possible reason? Is there a way to browse crawldb > to ensure that page really fetched? What else could I check? > > Thanks