Re: [CODE4LIB] PDF->text extraction

Demian Katz Tue, 21 Jun 2011 07:58:05 -0700

Have you tried Aperture (http://aperture.sourceforge.net/)?  It's a Java 
library for extracting content from various document formats including PDF.  It 
comes with command-line scripts that allow you to use it as a stand-alone 
utility.  If performance is your main concern, this may not be the best option 
since it's a heavier-duty tool than a simple PDF-only text extractor...  but if 
you want to expand the number of formats you support, it's worth a look.


- Demian

> -----Original Message-----
> From: Code for Libraries [mailto:[email protected]] On Behalf Of
> Owen Stephens
> Sent: Tuesday, June 21, 2011 10:24 AM
> To: [email protected]
> Subject: [CODE4LIB] PDF->text extraction
> 
> The CORE project at The Open University in the UK is doing some work on
> finding similarity between papers in institutional repositories (see
> http://core-project.kmi.open.ac.uk/ for more info).  The first step in
> the process is extracting text from the (mainly) pdf documents
> harvested from repositories
> 
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
> 
> Any other suggestions/experience?
> 
> Thanks,
> 
> Owen
> 
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [email protected]
> Telephone: 0121 288 6936

Re: [CODE4LIB] PDF->text extraction

Reply via email to