Hi Thanks for that. I actually missed the part where you need to include the plugins to Nutch entirely. So I have updated my nutch-site.xml to include the parse-pdf plugin as well as many others I have discovered you need to add! :P
Paul On 4/6/07, rubdabadub <[EMAIL PROTECTED]> wrote:
Could be .. 1. parse-pdf plugin is not enabled plugin in nutch-site.xml .. you need to enable it.. 2. The pdf file is over the content limit .. you need to increase the content limit value in nutch-site.xml. 3. Something else that i don't know.. Regards On 4/6/07, Paul Liddelow <[EMAIL PROTECTED]> wrote: > Hi > > Does anybody know what this means exactly: > > 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files > in parse-plugins.xml (Chris A. Mattmann via siren) > > In my crawl log file it says: > > Error parsing: http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for contentType=application/pdf > url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf > > This maybe a stupid question, but does the Nutch crawler only retrieve > and index links i.e. URL's and not pdf's? The .pdf isn't in the > crawl-urlfilter.txt file either. And I can see it in the > parse-plugins.xml file: > > <mimeType name="application/pdf"> > <plugin id="parse-pdf" /> > </mimeType> > > Thanks > Paul >
