Hi all,

I would like to suggest an API for extracting text (including highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we worked
with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF files,
but it has (at least had) some features, which I didn't satisfied with:

- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed text),
several thousands pages length documents (Hungarian scientific journals,
the diary of the Houses of Parliament from the 19th century etc.). We indexed
the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the web UI,
so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of things
were changed since that time.

Király Péter
http://eXtensibleCatalog.org

----- Original Message ----- From: "Mark A. Matienzo" <m...@matienzo.org>
To: <CODE4LIB@LISTSERV.ND.EDU>
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files


Eric,

 5. Use pdttotext to extract the OCRed text
   from the PDF and index it along with
   the MyLibrary metadata using Solr. [3, 4]


Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Reply via email to