setting http.content.limit to -1 seems to break text parsing on some files
--------------------------------------------------------------------------
Key: NUTCH-168
URL: http://issues.apache.org/jira/browse/NUTCH-168
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.7
Environment: Windows 2000
java version "1.4.2_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
Reporter: Jerry Russell
Setting http.content limit to -1 (which is supposed to mean no limit causes
some pages not to index. I have seen this in some PDFs and this one URL in
particular. The steps to reproduce are below:
Reproduce:
1) install fresh nutch-0.7
2) configure urlfilters to allow any URL
3) create urllist with only the following URL:
http://www.circuitsonline.net/circuits/view/71
4) perform a crawl with a depth of 1
5) do segread and see that the content is there
6) change the http.content.limit to -1 in nutch-default.xml
7) repeat the crawl to a new directory
8) do segread and see that the content is not there
contact [EMAIL PROTECTED] for more information.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira