I have started to see this problem recently. topN=200000 per crawl, but fetched pages = 150000 - 170000, while error pages = 2000 - 5000. >25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included.
I also see lots of "Response content length is not known" in the log. but, can't find where it comes from. Which class logs this message? AJ On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > Hi there, > > is there someone out there that can confirm a problem we discovered? > > We was wondering why not all pages of a generated segments was > fetched. The most strange thing was that the sum of errors and > sucesspages was never the same as we defined in topN when generating > a sgemtent . > First we discovered a problem with the segment size, but I can not > reproduce the problem anymore with the latest trunk code. :-/ > Very strange since I don't think something changed something but I > was able to reproduce that the size of the segment is around than 50% > of the defined size (topN) on 2 different map reduce installations. > > Anyway today we note that when fetching with http-client the sum of > errors and fetched pages is much less than the size defined when > generating the segment. > Changing to protocol-http solves the problem. > Has anyone also note this behavior? > > Thanks for comments. > Stefan > > > > > >