Re: [CODE4LIB] indexing pdf files

Erik Hatcher Tue, 15 Sep 2009 11:14:24 -0700

Here's a post on how easy it is to send PDF documents to Solr from Java:

<http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/>

Not only can you post PDF (and other rich content) files to Solr forindexing, you can also as shown in that blog entry extract the textfrom such files and have it returned to the client. This Solrcapability makes the tool chain a bit simpler.


        Erik


On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:

Hi all,
I would like to suggest an API for extracting text (includinghighlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when weworked
with extraordinary PDF files.
Solr uses Tika (http://lucene.apache.org/tika) for extracting textfrom
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDFfiles,but it has (at least had) some features, which I didn't satisfiedwith:
- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).
Our PDF files were double layered (original hi-res image + OCR-edtext),several thousands pages length documents (Hungarian scientificjournals,the diary of the Houses of Parliament from the 19th century etc.).We indexed
the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in theweb UI,
so the user can browse pages according to the full file's TOC.
This project happened two years ago, so it is possible, that lots ofthings
were changed since that time.

Király Péter
http://eXtensibleCatalog.org
----- Original Message ----- From: "Mark A. Matienzo" <[email protected]>
To: <[email protected]>
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files
Eric,
5. Use pdttotext to extract the OCRed text
  from the PDF and index it along with
  the MyLibrary metadata using Solr. [3, 4]
Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] indexing pdf files

Reply via email to