Re: [CODE4LIB] web-based ocr

Jason Best Wed, 13 Mar 2013 16:10:50 -0700

Ben beat me to the punch in mentioning the iDigBio hackathon OCR project and 
his own project for handwriting transcription. So I'll add a few other things. 
First, I'll soon be prototyping a RESTful API for OCR using Tesseract so anyone 
who is interested in providing input or contributing code, please ping me. I'll 
be creating this in Python but have not determined what, if any, API framework 
I'll use so if anyone has suggestions about this, please let me know. The 
"short" list that needs to get shorter is flask, CherryPy, and (on the heavier 
side) various RESTful solutions within Django such as piston. I'll be starting 
on this when my plate gets a little more clear - hopefully in a month or less.


Michael Giddens has written a simple web service for Tesseract (see 
http://www.silverbiology.com/blog/2011/03/10/amazon-ec2-tesseract-ocr-thank-you/)
 You'd have to provide the hardware, but he's provided the code. I have not 
used this myself, but it looks very straightforward.

Lastly, I'd like to plug iDigBio (https://www.idigbio.org) and the Augmenting 
OCR working group 
(https://www.idigbio.org/wiki/index.php/IDigBio_Working_Groups) a bit more. The 
biocollections community is up against this text transcription/OCR bottleneck 
and we are hoping to develop stronger ties with other communities with similar 
problems. This is one reason why we scheduled the first iDigBio hackathon 
during the 2013 iConference here in Fort Worth - so we could try to introduce 
our challenges to the information and library science communities. So I look 
forward to continuing the discussion and hopefully we'll collaborate/converge 
on solutions that have broad impacts.

Jason

On Mar 12, 2013, at 10:00 PM, CODE4LIB automatic digest system wrote:

Date:    Tue, 12 Mar 2013 11:57:06 -0400
From:    Eric Lease Morgan <[email protected]<mailto:[email protected]>>
Subject: web-based ocr

Does anybody here know of a Web-based OCR program or Web service?

Many people want to do OCR against digitized texts. We all know of various OCR 
applications (Adobe Acrobat, ABBYY FineReader, Google's Tesseract, etc.), but 
they are not necessarily Web-based. As a service to my university, I thought it 
might be cool (or "kewl") to support an image to text application. Go to Web 
form. Submit one or more image files. Have OCR done against them no matter how 
dirty the output. Return plain text. As a bonus, the application would support 
a REST-ful API.

Does anybody know of something like this that exists already?


Jason Best
Biodiversity Informatician
Botanical Research Institute of Texas
1700 University Drive
Fort Worth, Texas 76107

817-332-4441 ext. 230
http://www.brit.org

Re: [CODE4LIB] web-based ocr

Reply via email to