Re: [Dspace-tech] PDF text extraction

2009-02-23 Thread Eric Luhrs
It took some digging but this issue has been resolved. I am reporting back to this list because a few people have expressed interest. At Larry Stone's suggestion, I verified that pdftotext (part of xpdf) was able to extract text from my scanned PDF. I also re-ORCed the PDFs using Acrobat 8 Pro,

[Dspace-tech] PDF text extraction

2009-02-17 Thread Eric Luhrs
I just created a collection of 72 PDFs, mostly from scanned image files, but with several born digital files too. I was disappointed to learn that PDFbox was unable to process the scanned documents even though they contain searchable text. The files were created using a third-party OCR tool, but