Guillaume,

You'll need to increase (or set to unlimited) the size of the 
http.content.limit property.  To do this copy the property from the 
(nutch_install_dir)/conf/nutch-default.xml to nutch-site.xml as shown below:

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

You'll want to change <value>65536</value> to something larger or to "-1" for 
unlimited size.

Good luck,

bill

-----Original Message-----
From: guillaume lefebvre [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 06, 2005 10:08 AM
To: nutch-user
Subject: Re: PDF, XML, DOC, RTF Parsing

It's ok. I have found.

But I have some strange errors :

050406 155957 fetch okay, but can't parse 
http://localhost:8080/testsIndex/file.doc, reason: Content truncated at 70954 
bytes. Parser can't handle incomplete msword file.
050406 155958 fetch okay, but can't parse 
http://localhost:8080/testsIndex/file.pdf, reason: Content truncated at 70957 
bytes. Parser can't handle incomplete pdf file.
050406 160001 fetch okay, but can't parse 
http://localhost:8080/testsIndex/file.rtf, reason: Exception parsing RTF 
document

Thank you for helping me.

Guillaume

> How can I proceed to enable these parsers : what files must be 
> modified and how ?
> 
> Thank you very much !
> 
> Guillaume
> 
> 
> > You have to enable these parsers in your plugin
> configuration.   I know
> > pdf and doc works great myself, not sure about the others
> being supported.
> > 
> > -byron
> > 
> > -----Original Message-----
> > From: "guillaume lefebvre" <[EMAIL PROTECTED]>
> > To: "nutch-user" <nutch-user@incubator.apache.org>
> > Date: Wed,  6 Apr 2005 13:41:43 +0200
> > Subject: PDF, XML, DOC, RTF Parsing
> > 
> > > Hi,
> > > 
> > > I'm a new user of Nutch.
> > > 
> > > I have some problems to index PDF, XML, DOC, RTF. Is it
normal
> > > ? Does Nutch support the PDF, XML, DOC and RTF parsing ?
> > > 
> > > Thank you !
> > > Guillaume
> > > 
> > > 
> > > Accédez au courrier électronique de La Poste :
> www.laposte.net ;
> > > 3615 LAPOSTENET (0,34EUR/mn) ; tél : 08 92 68 13 50 (0,34EUR/mn)
> > > 
> > > 
> > > 
> > 
> > 
> 
> Accédez au courrier électronique de La Poste :
www.laposte.net ; 
> 3615 LAPOSTENET (0,34EUR/mn) ; tél : 08 92 68 13 50 (0,34EUR/mn)
> 
> 
> 
> 

Accédez au courrier électronique de La Poste : www.laposte.net ;
3615 LAPOSTENET (0,34EUR/mn) ; tél : 08 92 68 13 50 (0,34EUR/mn)






CONFIDENTIALITY NOTICE: This E-Mail is intended only 
for the use of the individual or entity to which it is addressed and may 
contain information that is privileged, confidential and exempt from disclosure 
under applicable law. If you have received this communication in error, please 
do not distribute and delete the original message.  Please notify the sender by 
E-Mail at the address shown. Thank you for your compliance.


Reply via email to