How can I enable parsing PDF and Word docs? My crawling log has this and similar lines:
051221 101613 fetching http://intranet/foo/bar.pdf 051221 101617 fetch okay, but can't parse http://intranet/foo/bar.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf In nutch-site.xml, I have: <property> <name>mime.type.magic</name> <value>true</value> <description>Defines if the mime content type detector uses magic resolution. </description> </property> <property> <name>mime.types.file</name> <value>mime-types.xml</value> <description>Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information</description> </property> I'm using the binary distribution and since default.properties has PDF plugin, I am guessing the binary is linked with PDF plugin. -Kuro
