Re: [OT] FOSS OCR solutions?

Yedidyah Bar-David Wed, 20 Apr 2005 11:46:02 -0700

On Wed, Apr 20, 2005 at 01:07:28AM +0300, Amit Aronovitch wrote:
> Hi list,
> 
> I'm looking for a free software tool for solving a specific OCR task 
> (details below).
> I've googled up several projects, but because of time limits I'll not be 
> able to give them a fair try.
> If you have some experience with such tools, I'll be grateful for any 
> tips. Specificly - how hard it would be to use them for solving the 
> specific task at hand.
> 
> Note that since the task is not a "standard" one, I don't expect any 
> tool to give good enough results out-of-the-box.
> Instead, I expect it to be modular, configurable, and scriptable enough 
> for writing a task-customised solution in short time.


I naturally know well only the one I wrote myself, hebocr. I am not sure
if it's good for your needs, but I _am_ sure it's the smallest one in
lines of code. So if you'll need to change it, it will probably be the
easiest one to try.

I played a bit with clara and it seems quite good.

The latest news about hebocr are that someone (not me) invested some
time in it and now it again works (even on cygwin) and has a tutorial.
It's still considered dead by me, unless someone wants to take over.

> 
> Description:
> -------------------
> We have a large database of scans consisting of numerical data, which 
> need to be accurately converted to text, with minimal human intervention.
> The scans are low quality (~200dpi, some with "dirt" (random dots), some 
> misaligned, some upside-down etc.).

Are they fixed-font?

> However, the data is numbers only, and comes in ordered tables with a 
> fixed number of columns (no border/cell lines at all). Numbers are of 
> fixed width.
> Most of the cells are full + there are certain rules which can be 
> applied (i.e.: if a row (entry) has a number in column A, it must 
> contain data in column B too).

I think no OCR has a language to say these rules, but some are script-
friendly enough to allow this easily. hebocr has a "log" file in which
it reports almost every important thing (including the position in
pixels of every char) so you can use it instead of the raw text output.

> 
> The data must be placed in the right columns: we prefer having a known 
> number of unknown digits (even a few incorrect ones) than 'missing' or 
> 'added' numbers.
> Preferably, we should be be able to get a "certainty" measure for each 
> digit.

I think every OCR can give you that. I know mine does.

> The solution should have some adjustable configuration parameters, such 
> that we'll be able to reprocess selected 'important' pages (based on 
> initial results) with some human adjustmants.

What parameters do you think about?

clara has lots of such params, but I find it a bit uncomfortable to use,
because you need the UI to play with them etc. Look at it, though.
hebocr has very few such params. The main way to affect it is the font
you provide (or build, using its output). Of course there are many
things you can also do outside the OCR - e.g. if the scans are gray, you
can change the threshold of white/black and have a dramatic change in
the accuracy. Few OCRs also read gray directly, but it's not common.

How large is the database? If it will take only a few hours (maybe
even a day or two) to manually type it, I think no free OCR will do
better. To beat that you need one that needs no training and has very
good results. These two needs are hard to provide together. And numbers
are very fast to type, with very few mistakes, compared to text. It's a
specific niche in the job market - I know because I know people who
worked in this, wanted to move to OCR, and didn't because it wasn't good
enough.

> 
> ----
> Sorry for the somewhat OT issue, but I know of no other forum I expect 
> to answer this in short time.
> (besides, the preferred target platform IS linux :-) ).

There was a long thread about OCR in whatsup around a year ago. You can
also ask there.

> 
>   Amit A.
-- 
Didi


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: [OT] FOSS OCR solutions?

Reply via email to