I have started to see this problem recently. topN=200000 per crawl, but
fetched pages = 150000 - 170000, while error pages = 2000 - 5000.  >25000
pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.

Depending on how you have Nutch configured, redirects can result in pages getting skipped, if the redirect count exceeds the (configurable) limit.

I don't know whether the "not found" HTTP status results in skipped (not reported as an error) case.

I also see lots of "Response content length is not known" in the log.  but,
can't find where it comes from.  Which class logs this message?

This is coming from the Jakarta commons httpclient code:

/src/java/org/apache/commons/httpclient/HttpMethodBase.java

-- Ken

On 12/19/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

 Hi there,

 is there someone out there that can confirm a problem we discovered?

 We was wondering why not all pages of a  generated segments was
 fetched. The most strange thing was that the  sum of errors and
 sucesspages was never the same as we defined in topN when generating
 a sgemtent .
 First we discovered a problem with the segment size, but I can not
 reproduce the problem anymore with the latest trunk code. :-/
 Very strange since I don't think something changed something but I
 was able to reproduce that the size of the segment is around than 50%
 of the defined size (topN) on 2 different map reduce installations.

 Anyway today we note that when fetching with http-client the sum of
 errors and fetched pages is  much less than the size defined when
 generating the segment.
 Changing to protocol-http solves the problem.
 Has anyone also note this behavior?

 Thanks for comments.
 Stefan








--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to