[
https://issues.apache.org/jira/browse/NUTCH-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806424#comment-13806424
]
Lewis John McGibbney commented on NUTCH-1643:
---------------------------------------------
[~talat] Thanks for the patch.
Two things
* I've also looked into the other protocol plugins. There is more which can be
added to this issue. protocol-httpclient is a definite as it seems to suffer
from the same problem. Do you wish to have a look and see where else
improvements can be made? This is of course up to you.
* I am not entirely sure about storing the content as null. My justification
here is as follows; say I was to have an http.content.limit set, but also
parser.skip.truncated value to false then there would be no content at all to
parse as the value is null (NPE in the back of my mind).
Is there some other solution to find the balance here?
> Unnecessary fetching with http.content.limit when using protocol-http
> ---------------------------------------------------------------------
>
> Key: NUTCH-1643
> URL: https://issues.apache.org/jira/browse/NUTCH-1643
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 2.1, 2.2, 2.2.1
> Reporter: Talat UYARER
> Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1643.patch
>
>
> In protocol-http, Even If I have http.content.limit value set, protocol-http
> fetches files of all sizes (larger files are fetched until limit allows).
> But when Parsing, parser skips incomplete files (if parser.skip.truncated
> configuration is true). It seems like an unnecessary effort to partially
> fetch contents larger than limit if they are not gonna be parsed.
--
This message was sent by Atlassian JIRA
(v6.1#6144)