I'm using nutch 0.9 and attempting to index PDFs. As such I've added the
following to my nutch-site.xml

<property>

  <name>plugin.includes</name>

  <value>protocol-http|urlfilter-regex|parse-(text|html|js|msword|pdf)|

index-basic|query-(basic|site|url)|summary-basic|scoring-opic|

urlnormalizer-(pass|regex|basic)</value>

</property>

 

Without this update to the nutch-site.xml the crawl completes fine
(there are pdfs linked from the crawl sites). With the update, however,
I always end up with the following error:

LinkDb: done

Indexer: starting

Indexer: linkdb: testdir/linkdb

Indexer: adding segment: testdir/segments/20071120114708

Indexer: adding segment: testdir/segments/20071120114717

Exception in thread "main" java.io.IOException: Job failed!

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

        at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)

        at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)

 

Is there any reason that nutch wouldn't be able to index these pdf
files?

 

Thanks,

 

Chris

Reply via email to