[ 
https://issues.apache.org/jira/browse/NUTCH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578975#action_12578975
 ] 

Andrzej Bialecki  commented on NUTCH-168:
-----------------------------------------

This branch has an End Of Life status. I believe this issue is fixed in recent 
branches.

> setting http.content.limit to -1 seems to break text parsing on some files
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-168
>                 URL: https://issues.apache.org/jira/browse/NUTCH-168
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.7
>         Environment: Windows 2000
> java version "1.4.2_05"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
> Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
>            Reporter: Jerry Russell
>
> Setting http.content limit to -1 (which is supposed to mean no limit causes 
> some pages not to index. I have seen this in some PDFs and this one URL in 
> particular. The steps to reproduce are below:
> Reproduce:
>   1) install fresh nutch-0.7
>   2) configure urlfilters to allow any URL
>   3) create urllist with only the following URL: 
> http://www.circuitsonline.net/circuits/view/71
>   4) perform a crawl with a depth of 1
>   5) do segread and see that the content is there
>   6) change the http.content.limit to -1 in nutch-default.xml 
>   7) repeat the crawl to a new directory 
>   8) do segread and see that the content is not there
> contact [EMAIL PROTECTED] for more information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to