I'm using nutch 0.9 and attempting to index PDFs. As such I've added the
following to my nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js|msword|pdf)|
index-basic|query-(basic|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
</property>
Without this update to the nutch-site.xml the crawl completes fine
(there are pdfs linked from the crawl sites). With the update, however,
I always end up with the following error:
LinkDb: done
Indexer: starting
Indexer: linkdb: testdir/linkdb
Indexer: adding segment: testdir/segments/20071120114708
Indexer: adding segment: testdir/segments/20071120114717
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)
Is there any reason that nutch wouldn't be able to index these pdf
files?
Thanks,
Chris