Could be ..

1. parse-pdf plugin is not enabled plugin in nutch-site.xml .. you
need to enable it..
2. The pdf file is over the content limit .. you need to increase the
content limit value in nutch-site.xml.
3. Something else that i don't know..

Regards

On 4/6/07, Paul Liddelow <[EMAIL PROTECTED]> wrote:
Hi

Does anybody know what this means exactly:

8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
    in parse-plugins.xml (Chris A. Mattmann via siren)

In my crawl log file it says:

Error parsing: 
http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found
for contentType=application/pdf
url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf

This maybe a stupid question, but does the Nutch crawler only retrieve
and index links i.e. URL's and not pdf's? The .pdf isn't in the
crawl-urlfilter.txt file either. And I can see it in the
parse-plugins.xml file:

<mimeType name="application/pdf">
                <plugin id="parse-pdf" />
        </mimeType>

Thanks
Paul

Reply via email to