setting http.content.limit to -1 seems to break text parsing on some files
--------------------------------------------------------------------------

         Key: NUTCH-168
         URL: http://issues.apache.org/jira/browse/NUTCH-168
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
 Environment: Windows 2000
java version "1.4.2_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
    Reporter: Jerry Russell


Setting http.content limit to -1 (which is supposed to mean no limit causes 
some pages not to index. I have seen this in some PDFs and this one URL in 
particular. The steps to reproduce are below:

Reproduce:

  1) install fresh nutch-0.7
  2) configure urlfilters to allow any URL
  3) create urllist with only the following URL: 
http://www.circuitsonline.net/circuits/view/71
  4) perform a crawl with a depth of 1
  5) do segread and see that the content is there
  6) change the http.content.limit to -1 in nutch-default.xml 
  7) repeat the crawl to a new directory 
  8) do segread and see that the content is not there

contact [EMAIL PROTECTED] for more information.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to