[CODE4LIB] PDF->text extraction

Owen Stephens Tue, 21 Jun 2011 07:28:57 -0700

The CORE project at The Open University in the UK is doing some work on finding 
similarity between papers in institutional repositories (see 
http://core-project.kmi.open.ac.uk/ for more info).  The first step in the 
process is extracting text from the (mainly) pdf documents harvested from 
repositories


We've tried iText but had issues with quality
We moved to PDFBox but are having performance issues

Any other suggestions/experience?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

[CODE4LIB] PDF->text extraction

Reply via email to