In the nutch-default.xml file you have the configuration option
plugin.includes. Copy that property to the nutch-site.xml file and
change the parse-(text|html|js) to look like this
parse-(text|html|js|pdf) This will enable the pdf parser plugin.
Dennis Kubes
Sævaldur Arnar Gunnarsson wrote:
Hi, I'm evaluating Nutch as a search platform for a large Icelandic
website.
The website has a quite large collection of Adobe Acrobat documents
(PDF) stored on a Lotus Domino server.
I run nutch with
./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/
-depth 9999 -topN 9999
Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the
following error:
Error parsing:
http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found for
contentType=application/pdf
url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf
With best regards,