setting http.content.limit to -1 seems to break text parsing on some files
--------------------------------------------------------------------------
Key: NUTCH-168
URL: http://issues.apache.org/jira/browse/NUTCH-168
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.7
Environment: Windows 2000
java version "1.4.2_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
Reporter: Jerry Russell
Setting http.content limit to -1 (which is supposed to mean no limit causes
some pages not to index. I have seen this in some PDFs and this one URL in
particular. The steps to reproduce are below:
Reproduce:
1) install fresh nutch-0.7
2) configure urlfilters to allow any URL
3) create urllist with only the following URL:
http://www.circuitsonline.net/circuits/view/71
4) perform a crawl with a depth of 1
5) do segread and see that the content is there
6) change the http.content.limit to -1 in nutch-default.xml
7) repeat the crawl to a new directory
8) do segread and see that the content is not there
contact [EMAIL PROTECTED] for more information.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers