Oh, never mind. I guess I needed change the value of the property plugin.includes in nutch-ste.xml to include parse-pdf.
I'm sorry for the noise. > -----Original Message----- > From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] > Sent: 2005-12-23 10:19 > To: [email protected] > Subject: How to configure Nutch to parse PDF and Word docs? > > > How can I enable parsing PDF and Word docs? > > My crawling log has this and similar lines: > > 051221 101613 fetching http://intranet/foo/bar.pdf 051221 101617 fetch > okay, but can't parse http://intranet/foo/bar.pdf, reason: > failed(2,203): Content-Type not text/html: application/pdf > > > In nutch-site.xml, I have: > <property> > <name>mime.type.magic</name> > <value>true</value> > <description>Defines if the mime content type detector uses magic > resolution. > </description> > </property> > > <property> > <name>mime.types.file</name> > <value>mime-types.xml</value> > <description>Name of file in CLASSPATH containing filename extension > and > magic sequence to mime types mapping information</description> > </property> > > > I'm using the binary distribution and since default.properties has PDF > plugin, I am guessing > the binary is linked with PDF plugin. > > -Kuro >
