Here's a post on how easy it is to send PDF documents to Solr from Java:
<http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/
>
Not only can you post PDF (and other rich content) files to Solr for
indexing, you can also as shown in that blog entry extract the text
from such files and have it returned to the client. This Solr
capability makes the tool chain a bit simpler.
Erik
On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:
Hi all,
I would like to suggest an API for extracting text (including
highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we
worked
with extraordinary PDF files.
Solr uses Tika (http://lucene.apache.org/tika) for extracting text
from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF
files,
but it has (at least had) some features, which I didn't satisfied
with:
- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)
- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).
Our PDF files were double layered (original hi-res image + OCR-ed
text),
several thousands pages length documents (Hungarian scientific
journals,
the diary of the Houses of Parliament from the 19th century etc.).
We indexed
the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the
web UI,
so the user can browse pages according to the full file's TOC.
This project happened two years ago, so it is possible, that lots of
things
were changed since that time.
Király Péter
http://eXtensibleCatalog.org
----- Original Message ----- From: "Mark A. Matienzo" <m...@matienzo.org
>
To: <CODE4LIB@LISTSERV.ND.EDU>
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files
Eric,
5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]
Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.
[1] http://wiki.apache.org/solr/ExtractingRequestHandler
Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library