Fw: PDF support? Does crawl parse pdf files? How do I get it work?

Diane Palla Wed, 31 Aug 2005 12:39:53 -0700

Does Nutch have a way to parse pdf files, that is, "application/pdf" 
content type files?


I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
<mainurl>/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf


If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?


Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]

Fw: PDF support? Does crawl parse pdf files? How do I get it work?

Reply via email to