IIRC, Lucene (which Nutch uses for document indexing) actually indexes data
types via plugins. So if you have a plugin for PDF parsing (I believe there
is one), then you would be able to do what you wish for it.


I have been tasked by my boss of finding out if Nutch indexes content in an
image in a pdf document via OCR and then recognize it as text. So in other
words, if someone uploads a PDF document to our site, and the PDF document
is of an image that is saved as PDF, will nutch search the text within the
image and then catalog the text as part of that PDF document?

Please ask this type of questions on nutch-user list. nutch-agent is primarily for discussing behavior of Nutch-based robots.

To answer your question: Nutch can extract plain text from PDF-s that contain plain text. Those PDFs that contain just images (i.e. text as bitmap pictures) cannot be indexed without using some sort of OCR. It's possible to integrate OCR into Nutch workflow, but currently this is not yet implemented.

