Does Nutch have a way to parse pdf files, that is, "application/pdf" content type files?
I noticed a plugin variable setting in default.properties: plugin.pdf=org.apache.nutch.parse.pdf* I never changed this file. Is that the right value? I am using Nutch 0.7. What do I have to do make parse pdf files? When I do the crawl, I get this error with application/pdf files: 050831 145126 fetch okay, but can't parse <mainurl>/research/126900/126969/126969.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf If it's not possible, what future version of Nutch do developers expect to support application/pdf types and have such parsing of pdf files available? Diane Palla Web Services Developer Seton Hall University 973 313-6199 [EMAIL PROTECTED]
