Re: [CODE4LIB] OCR PDFs

James Tuttle Fri, 17 Oct 2008 15:03:45 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, I've tried tesseract and found it to be pretty accurate, but I
don't believe there is a way to integrate the text back into the PDF.
It's easy to pull text out of image-based PDFs, but not to put the text
back in.  Driving me crazy...


Thanks for tips,
James

Bridger Dyson-Smith wrote:
> If you haven't already, take a look at tesseract (
> http://code.google.com/p/tesseract-ocr/). There's some discussion of using
> tesseract and shell scripting to work with tiffs to pdfs to ocr'd text,
> which isn't exactly what you're wanting to do, I know, but may prove helpful
> (http://www.groklaw.net/articlebasic.php?story=20061210115516438).
> Cheers!
> Bridger Dyson-Smith
> 
> 
> On Fri, Oct 17, 2008 at 8:28 AM, Terry Harrison <[EMAIL PROTECTED]> wrote:
> 
>> You might want to look at ABBYY Fine Reader 9.0 Professional, which can be
>> driven from the command line.  Fine Reader  is used at the Library of
>> Congress.  Here is a info link to get you started (search "command"):
>>
>>
>> http://www.scanstore.com/Scanning/Document_Imaging/Software/OCR_Software/Nuance/omnipage_review.asp
>>
>> Regards,
>> Terry
>>
>> ------------------------------------
>> Terry Harrison
>> Project Manager
>> CACI
>> 5505 Robin Hood Road, Suite F
>> Norfolk, Va. 23508
>> Ph: 757.321.9120 x232
>> Fax: 757.321.8797
>> [EMAIL PROTECTED]
>>

- --
- -------------------------------
James Tuttle
Digital Repository Librarian

NCSU Libraries, Box 7111
North Carolina State University
Raleigh, NC 27695-7111
[EMAIL PROTECTED]

(919)513-0651 Phone
(919)515-3031  Fax

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI+QuEKxpLzx+LOWMRAhSyAJ9+lQ/1J5SP/23XQrVrlsoNRZyKxQCfYTGw
qUBK6A9mkiLy88buUz7Wngg=
=DyZk
-----END PGP SIGNATURE-----

Re: [CODE4LIB] OCR PDFs

Reply via email to