Re: [OT] FOSS OCR solutions?

Yedidyah Bar-David Thu, 21 Apr 2005 02:18:44 -0700

On Thu, Apr 21, 2005 at 01:15:41AM +0300, Amit Aronovitch wrote:
> Thanks alot for the reply.
> I'll try having a look at both tools.
> 
> Yedidyah Bar-David wrote:
> 
> >>Description:
> >>-------------------
> >>We have a large database of scans consisting of numerical data, which 
> >>need to be accurately converted to text, with minimal human intervention.
> >>The scans are low quality (~200dpi, some with "dirt" (random dots), some 
> >>misaligned, some upside-down etc.).
> >>   
> >>
> >
> >Are they fixed-font?
> >
> > 
> >
> Yes.  At least the pages I saw (data came from several sources).


So it'll be easier. hebocr and clara do only fixed-font.

> 
> >>However, the data is numbers only, and comes in ordered tables with a 
> >>fixed number of columns (no border/cell lines at all). Numbers are of 
> >>fixed width.
> >>Most of the cells are full + there are certain rules which can be 
> >>applied (i.e.: if a row (entry) has a number in column A, it must 
> >>contain data in column B too).
> >>   
> >>
> >
> >I think no OCR has a language to say these rules, but some are script-
> >friendly enough to allow this easily. hebocr has a "log" file in which
> >it reports almost every important thing (including the position in
> >pixels of every char) so you can use it instead of the raw text output.
> >
> > 
> >
> I was thinking of writing some script/program that calls some lower 
> level API (e.g. call some algorithm for finding cell positions, then use 
> above mentioned rules to improve this suggestion, then once you know 
> where the cells are, and you know each one should contain a number with 
> a known number of digits - call the actual digit recognition function 
> for each cell).
> 
>  Do you think it would be simple enough to write such an algorithm with 
> hebocr? (e.g.: does it have seperately callable "region-finding" and 
> "char-recognition" functions?).

It does have specific sets of functions (it's C - no classes,
packages, etc.) for dealing with different tasks. It has a short
text file describing how it works internally (written by me) and,
as I said, someone wrote a much longer tutorial, which I don't know
if is available online, I'll ask his permission.

> 
> >>The data must be placed in the right columns: we prefer having a known 
> >>number of unknown digits (even a few incorrect ones) than 'missing' or 
> >>'added' numbers.
> >>Preferably, we should be be able to get a "certainty" measure for each 
> >>digit.
> >>   
> >>
> >
> >I think every OCR can give you that. I know mine does.
> > 
> >
> Well, I saw some commercial tools that just don't have this (probably 
> they have it in some lower level but don't export it to user. With 
> closed source - this is a dead end...).
> 
> > 
> >
> >>The solution should have some adjustable configuration parameters, such 
> >>that we'll be able to reprocess selected 'important' pages (based on 
> >>initial results) with some human adjustmants.
> >>   
> >>
> >
> >What parameters do you think about?
> >clara has lots of such params, but I find it a bit uncomfortable to use,
> >because you need the UI to play with them etc. Look at it, though.
> >hebocr has very few such params. The main way to affect it is the font
> >you provide (or build, using its output). Of course there are many
> >things you can also do outside the OCR - e.g. if the scans are gray, you
> >can change the threshold of white/black and have a dramatic change in
> >the accuracy. Few OCRs also read gray directly, but it's not common.
> >
> > 
> >
> Theyre b/w. I was thinking of stuff like thresholds for some filters and 
> preprocessing algorithms. e.g. despeckle parameters (if page was noisy 
> you could change),  thresholds for cell/area detection algorithms (if 
> for some reason you see that the algorithm missed some cells - you can 
> tweak it).

I don't think FOSS OCRs deal with such things. There are other tools
for this, and you can glue them together with a small script.

I know I personally scanned in grayscale and only played with the
threshold for white/black. In my case there were pages with different
blackness, and so eventually I wrote a little script that computes
the threshold so that they all have more-or-less the same coverage
(17% in my case).

> Yes - probably would be uncomfortable to play with (UI could help), but 
> you'd only do this for 'important' pages. it should still be faster than 
> typing these pages manually.
> 
> >How large is the database? If it will take only a few hours (maybe
> >even a day or two) to manually type it, I think no free OCR will do
> >better. To beat that you need one that needs no training and has very
> >good results. These two needs are hard to provide together. And numbers
> >are very fast to type, with very few mistakes, compared to text. It's a
> >specific niche in the job market - I know because I know people who
> >worked in this, wanted to move to OCR, and didn't because it wasn't good
> >enough.
> >
> > 
> >
> As a very rough guess - 1000 pages (and their'e pretty dense with 
> digits) - would simply require too many typists for too much time. We'd 
> probably need some initial scan anyway - to prioritize the pages, and 
> speed the typing by providing an initial suggestion to start with (e.g. 
> scan could leave 'unknowns' in places it's not sure about, and typist 
> would fill them in).

1000 pages sounds reasonable for trying an OCR.
-- 
Didi


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: [OT] FOSS OCR solutions?

Reply via email to