I have started to see this problem recently. topN=200000 per crawl, but
fetched pages = 150000 - 170000, while error pages = 2000 - 5000.  >25000
pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.

I also see lots of "Response content length is not known" in the log.  but,
can't find where it comes from.  Which class logs this message?

AJ

On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> Hi there,
>
> is there someone out there that can confirm a problem we discovered?
>
> We was wondering why not all pages of a  generated segments was
> fetched. The most strange thing was that the  sum of errors and
> sucesspages was never the same as we defined in topN when generating
> a sgemtent .
> First we discovered a problem with the segment size, but I can not
> reproduce the problem anymore with the latest trunk code. :-/
> Very strange since I don't think something changed something but I
> was able to reproduce that the size of the segment is around than 50%
> of the defined size (topN) on 2 different map reduce installations.
>
> Anyway today we note that when fetching with http-client the sum of
> errors and fetched pages is  much less than the size defined when
> generating the segment.
> Changing to protocol-http solves the problem.
> Has anyone also note this behavior?
>
> Thanks for comments.
> Stefan
>
>
>
>
>
>

Reply via email to