Out of the box, it's hard to top Abbyy, but Tesseract is well worth 
investigating, especially if you are dealing with a large quantity of 
consistent images. The Tesseract community has created a very useful wiki [1], 
especially on how to improve the quality of images that need to be OCRed [2], 
and there is some new neural network based plumbing that has great potential 
[3]. Tesseract also lets you do your own font training, I work with a 
non-profit called OurDigitalWorld that needed Inuktitut support for a 
publication called "Inuit Today" and we were able to create the supporting 
files to do the processing, an approach you can also use for special symbols in 
text (musical notation, etc.) If you combine Tesseract with other open source 
tools like Imagemagick (to prep images), Olena (to segment column-heavy media 
like newspapers), and Hadoop (if you are working with thousands or millions of 
pages), it can do a lot of heavy lifting. 

art
---
1. https://github.com/tesseract-ocr/tesseract/wiki
2. https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
3. https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Will 
Martin
Sent: Wednesday, July 19, 2017 1:14 PM
To: [email protected]
Subject: [CODE4LIB] OCR software

All,

What are you all using for OCR software?  How well does it work for you? 
  Do you find that need to scan at a particular resolution to get optimal OCR 
results, or do you find yourself doing post-processing on the images before 
OCR'ing them?  What have your experiences been like?

In the past, we've just used the built-in OCR in Adobe Acrobat Pro.  But we're 
looking at doing a bunch more digitization than we have before, and I just want 
to take stock of what's out there and see if that's an acceptable solution or 
if there's something else we should consider.

Thanks!

Will Martin

Head of Digital Initiatives, Systems & Services Chester Fritz Library 
University of North Dakota

Reply via email to