ASF GitHub Bot commented on NUTCH-2666:

sebastian-nagel commented on pull request #427: NUTCH-2666 Increase default 
value for http.content.limit / ftp.content.limit / file.content.limit
URL: https://github.com/apache/nutch/pull/427
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --------------------------------------------------------------------------------------
>                 Key: NUTCH-2666
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2666
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Marco Ebbinghaus
>            Priority: Minor
>             Fix For: 1.16
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.

This message was sent by Atlassian JIRA

Reply via email to