IIRC, Lucene (which Nutch uses for document indexing) actually indexes data
types via plugins. So if you have a plugin for PDF parsing (I believe there
is one), then you would be able to do what you wish for it.


On Thu, Feb 26, 2009 at 11:40 AM, Robert Edmiston <
> wrote:

> I have been tasked by my boss of finding out if Nutch indexes content in an
> image in a pdf document via OCR and then recognize it as text. So in other
> words, if someone uploads a PDF document to our site, and the PDF document
> is of an image that is saved as PDF, will nutch search the text within the
> image and then catalog the text as part of that PDF document?
> *Does Nutch index content for .PDF image on text format?*

Reply via email to