Greetings, IIRC, Lucene (which Nutch uses for document indexing) actually indexes data types via plugins. So if you have a plugin for PDF parsing (I believe there is one), then you would be able to do what you wish for it.
Cheers, Bradford On Thu, Feb 26, 2009 at 11:40 AM, Robert Edmiston <robert.edmis...@gmail.com > wrote: > I have been tasked by my boss of finding out if Nutch indexes content in an > image in a pdf document via OCR and then recognize it as text. So in other > words, if someone uploads a PDF document to our site, and the PDF document > is of an image that is saved as PDF, will nutch search the text within the > image and then catalog the text as part of that PDF document? > > > *Does Nutch index content for .PDF image on text format?* >