How to configure Nutch to parse PDF and Word docs?

Teruhiko Kurosaka Fri, 23 Dec 2005 10:19:07 -0800

How can I enable parsing PDF and Word docs?

My crawling log has this and similar lines:


051221 101613 fetching http://intranet/foo/bar.pdf 051221 101617 fetch
okay, but can't parse http://intranet/foo/bar.pdf, reason:
failed(2,203): Content-Type not text/html: application/pdf


In nutch-site.xml, I have:
<property>
  <name>mime.type.magic</name>
  <value>true</value>
  <description>Defines if the mime content type detector uses magic
resolution.
  </description>
</property>

<property>
  <name>mime.types.file</name>
  <value>mime-types.xml</value>
  <description>Name of file in CLASSPATH containing filename extension
and
  magic sequence to mime types mapping information</description>
</property>


I'm using the binary distribution and since default.properties has PDF
plugin, I am guessing
the binary is linked with PDF plugin.

-Kuro

How to configure Nutch to parse PDF and Word docs?

Reply via email to