It took some digging but this issue has been resolved.  I am reporting back
to this list because a few people have expressed interest.

At Larry Stone's suggestion, I verified that pdftotext (part of xpdf) was
able to extract text from my scanned PDF.  I also re-ORCed the PDFs using
Acrobat 8 Pro, and found that media-filter was able to extract the text with
no problem.  Realizing that the problem was with my OCR app (JRA Publish,
which is really great for creating batches of super-small PDF and DjVu files
with highly accurate OCR), I contacted the lead developer and learned that a
similar issue had been reported earlier.  It turns out that PDFbox was
looking for the "ToUnicode" flag in the OCRed text, and failing when it was
not found.  My copy of JRA Publish was a few years old, but the new version
included the flag that PDFbox needed to extract text from my files.

Along the way, I also found a helpful document in Michigan's DeepBlue
repository that provides some best practices for scanned and born-digital
PDFs.  Anyone interested in creating better PDFs should take a look here:

http://deepblue.lib.umich.edu/handle/2027.42/58005

Eric Luhrs
Lafayette College

On Tue, Feb 17, 2009 at 5:05 PM, Eric Luhrs <elu...@gmail.com> wrote:

> I just created a collection of 72 PDFs, mostly from scanned image files,
> but with several born digital files too.  I was disappointed to learn that
> PDFbox was unable to process the scanned documents even though they contain
> searchable text.  The files were created using a third-party OCR tool, but I
> am able to copy and paste the text using Acrobat.
>
> I understand that DSpace is limited by what PDFbox is able to process, so
> my question is, are there any guidlines for PDF creation to help ensure that
> PDFbox can read them?  For instance, maybe it only understands certain
> versions of the PDF language, or certain types of compression.
>
> Any suggestions?  I figured I'd try here before contacting the PDFbox
> community.
>
> Eric Luhrs
> Lafayette College
>
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to