Open source OCR (was Re: [openhealth] Re: Hi folks..)

Tim Churches Sun, 18 Feb 2007 17:20:16 -0800

Tim Churches wrote:
> Karsten Hilbert wrote:
>> Well, the path of least resistance here is to scan it and
>> use it as a background image in some text editor or other so
>> that what you type appears to be written into the fields
>> while it is (technically) written on top of the background
>> image. We then save the result as any other old document
>> tied into the medical record.
> 
> No, we need the data in computable form for epidemiological (aggregate)
> analysis - images of numbers nd characters must be converted to ASCII or
> Unicode bytes. There is a commercial product, Teleform, which does this
> reasonably well - see
> http://www.cardiff.com/products/teleform/index.html - and we may just
> provide an interface which can load data which has been scanned off
> hand-written forms using that, but gee, an open source solution would be
> nice. Suggestions very welcome.


A few months ago Google released Tesseract OCR, an oCR engine developed
in the 1990s by Hewlett-Packard. Apparently it was state-of-the-art in
1995, but that's over a decade ago, and has not been developed since.
There don't seem to be any other open source OCR engines around that are
being actively developed or which are anything more than demos or
proofs-of-concept. And Teleform seems to have the OCR-from-paper-forms
market almost to themselves. I think we'll have to build a batch input
interface that Teleform can be plugged into - I think it exports to XML,
or at the very least CSV files.

But if anyone can suggest an alternative for turning data recorded on
paper forms into data (as opposed to raster image) files, we'd love to
hear of it.

Tim C

Open source OCR (was Re: [openhealth] Re: Hi folks..)

Reply via email to