I have started to see this problem recently. topN=200000 per crawl, but
fetched pages = 150000 - 170000, while error pages = 2000 - 5000. >25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
Depending on how you have Nutch configured, redirects can result in
pages getting skipped, if the redirect count exceeds the
(configurable) limit.
I don't know whether the "not found" HTTP status results in skipped
(not reported as an error) case.
I also see lots of "Response content length is not known" in the log. but,
can't find where it comes from. Which class logs this message?
This is coming from the Jakarta commons httpclient code:
/src/java/org/apache/commons/httpclient/HttpMethodBase.java
-- Ken
On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Hi there,
is there someone out there that can confirm a problem we discovered?
We was wondering why not all pages of a generated segments was
fetched. The most strange thing was that the sum of errors and
sucesspages was never the same as we defined in topN when generating
a sgemtent .
First we discovered a problem with the segment size, but I can not
reproduce the problem anymore with the latest trunk code. :-/
Very strange since I don't think something changed something but I
was able to reproduce that the size of the segment is around than 50%
of the defined size (topN) on 2 different map reduce installations.
Anyway today we note that when fetching with http-client the sum of
errors and fetched pages is much less than the size defined when
generating the segment.
Changing to protocol-http solves the problem.
Has anyone also note this behavior?
Thanks for comments.
Stefan
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200