RE: How to configure Nutch to parse PDF and Word docs?

Teruhiko Kurosaka Fri, 23 Dec 2005 11:08:39 -0800

Oh, never mind.  I guess I needed change the value of the property
plugin.includes in nutch-ste.xml
to include parse-pdf.


I'm sorry for the noise.

> -----Original Message-----
> From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
> Sent: 2005-12-23 10:19
> To: [email protected]
> Subject: How to configure Nutch to parse PDF and Word docs?
> 
> 
> How can I enable parsing PDF and Word docs?
> 
> My crawling log has this and similar lines:
> 
> 051221 101613 fetching http://intranet/foo/bar.pdf 051221 101617 fetch
> okay, but can't parse http://intranet/foo/bar.pdf, reason:
> failed(2,203): Content-Type not text/html: application/pdf
> 
> 
> In nutch-site.xml, I have:
> <property>
>   <name>mime.type.magic</name>
>   <value>true</value>
>   <description>Defines if the mime content type detector uses magic
> resolution.
>   </description>
> </property>
> 
> <property>
>   <name>mime.types.file</name>
>   <value>mime-types.xml</value>
>   <description>Name of file in CLASSPATH containing filename extension
> and
>   magic sequence to mime types mapping information</description>
> </property>
> 
> 
> I'm using the binary distribution and since default.properties has PDF
> plugin, I am guessing
> the binary is linked with PDF plugin.
> 
> -Kuro
>

RE: How to configure Nutch to parse PDF and Word docs?

Reply via email to