Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by SebastienLeCallonnec:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
Added FAQ on how to increase size of downloaded docs.

------------------------------------------------------------------------------
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[http://www.mozilla.org/quality/networking/testing/filetests.html here] and 
this behavior may be disabled by a 
[http://www.mozilla.org/quality/networking/docs/netprefs.html preference] (see 
security.checkloaduri). IE5 does not have this problem.
  
+ '''While indexing documents, I get the following error:''' ''050529 011245 
fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. 
Parser can't handle incomplete msword file.'' '''What is happening?'''
  
+ By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+ 
+ <property>
+   <name>http.content.limit</name>
+   <value>'''150000'''</value>
+ </property> 
+ 
+ If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value.
  ----
  
  == Segment Handling ==


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to