Hi Does anybody know what this means exactly:
8. NUTCH-338 - Remove the text parser as an option for parsing PDF files in parse-plugins.xml (Chris A. Mattmann via siren) In my crawl log file it says: Error parsing: http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf This maybe a stupid question, but does the Nutch crawler only retrieve and index links i.e. URL's and not pdf's? The .pdf isn't in the crawl-urlfilter.txt file either. And I can see it in the parse-plugins.xml file: <mimeType name="application/pdf"> <plugin id="parse-pdf" /> </mimeType> Thanks Paul
