[
https://issues.apache.org/jira/browse/NUTCH-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Talat UYARER updated NUTCH-1643:
--------------------------------
Attachment: NUTCH-1643v3.patch
Hİ [~lewismc], Today You work very hard :) I add necessary codes for
protocol-httpclient. For the other protocols we don't know remote content size.
Because of this we can't prevent for unnecessary fetching :(. I think this is
ok for us.
> Unnecessary fetching with http.content.limit when using protocol-http
> ---------------------------------------------------------------------
>
> Key: NUTCH-1643
> URL: https://issues.apache.org/jira/browse/NUTCH-1643
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 2.1, 2.2, 2.2.1
> Reporter: Talat UYARER
> Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1643.patch, NUTCH-1643v2.patch, NUTCH-1643v3.patch
>
>
> In protocol-http, Even If I have http.content.limit value set, protocol-http
> fetches files of all sizes (larger files are fetched until limit allows).
> But when Parsing, parser skips incomplete files (if parser.skip.truncated
> configuration is true). It seems like an unnecessary effort to partially
> fetch contents larger than limit if they are not gonna be parsed.
--
This message was sent by Atlassian JIRA
(v6.1#6144)