Re: Does Nutch index content for .PDF image on text format?

Andrzej Bialecki Fri, 27 Feb 2009 09:08:59 -0800

Bradford Stephens wrote:

Greetings,


IIRC, Lucene (which Nutch uses for document indexing) actually indexes data
types via plugins. So if you have a plugin for PDF parsing (I believe there
is one), then you would be able to do what you wish for it.

Cheers,
Bradford

On Thu, Feb 26, 2009 at 11:40 AM, Robert Edmiston <robert.edmis...@gmail.com

wrote:

I have been tasked by my boss of finding out if Nutch indexes content in an
image in a pdf document via OCR and then recognize it as text. So in other
words, if someone uploads a PDF document to our site, and the PDF document
is of an image that is saved as PDF, will nutch search the text within the
image and then catalog the text as part of that PDF document?

Please ask this type of questions on nutch-user list. nutch-agent isprimarily for discussing behavior of Nutch-based robots.

To answer your question: Nutch can extract plain text from PDF-s thatcontain plain text. Those PDFs that contain just images (i.e. text asbitmap pictures) cannot be indexed without using some sort of OCR. It'spossible to integrate OCR into Nutch workflow, but currently this is notyet implemented.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Does Nutch index content for .PDF image on text format?

Reply via email to