[Dspace-tech] PDF text extraction

Eric Luhrs Tue, 17 Feb 2009 15:32:42 -0800

I just created a collection of 72 PDFs, mostly from scanned image files, but
with several born digital files too.  I was disappointed to learn that
PDFbox was unable to process the scanned documents even though they contain
searchable text.  The files were created using a third-party OCR tool, but I
am able to copy and paste the text using Acrobat.


I understand that DSpace is limited by what PDFbox is able to process, so my
question is, are there any guidlines for PDF creation to help ensure that
PDFbox can read them?  For instance, maybe it only understands certain
versions of the PDF language, or certain types of compression.

Any suggestions?  I figured I'd try here before contacting the PDFbox
community.

Eric Luhrs
Lafayette College

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] PDF text extraction

Reply via email to