Re: Does Nutch index content for .PDF image on text format?

Bradford Stephens Thu, 26 Feb 2009 11:50:35 -0800

Greetings,

IIRC, Lucene (which Nutch uses for document indexing) actually indexes data
types via plugins. So if you have a plugin for PDF parsing (I believe there
is one), then you would be able to do what you wish for it.


Cheers,
Bradford

On Thu, Feb 26, 2009 at 11:40 AM, Robert Edmiston <robert.edmis...@gmail.com
> wrote:

> I have been tasked by my boss of finding out if Nutch indexes content in an
> image in a pdf document via OCR and then recognize it as text. So in other
> words, if someone uploads a PDF document to our site, and the PDF document
> is of an image that is saved as PDF, will nutch search the text within the
> image and then catalog the text as part of that PDF document?
>
>
> *Does Nutch index content for .PDF image on text format?*
>

Re: Does Nutch index content for .PDF image on text format?

Reply via email to