Thanks alot for the reply.
I'll try having a look at both tools.

Yedidyah Bar-David wrote:

Description:
-------------------
We have a large database of scans consisting of numerical data, which need to be accurately converted to text, with minimal human intervention.
The scans are low quality (~200dpi, some with "dirt" (random dots), some misaligned, some upside-down etc.).



Are they fixed-font?



Yes.  At least the pages I saw (data came from several sources).

However, the data is numbers only, and comes in ordered tables with a fixed number of columns (no border/cell lines at all). Numbers are of fixed width.
Most of the cells are full + there are certain rules which can be applied (i.e.: if a row (entry) has a number in column A, it must contain data in column B too).



I think no OCR has a language to say these rules, but some are script- friendly enough to allow this easily. hebocr has a "log" file in which it reports almost every important thing (including the position in pixels of every char) so you can use it instead of the raw text output.



I was thinking of writing some script/program that calls some lower level API (e.g. call some algorithm for finding cell positions, then use above mentioned rules to improve this suggestion, then once you know where the cells are, and you know each one should contain a number with a known number of digits - call the actual digit recognition function for each cell).

Do you think it would be simple enough to write such an algorithm with hebocr? (e.g.: does it have seperately callable "region-finding" and "char-recognition" functions?).

The data must be placed in the right columns: we prefer having a known number of unknown digits (even a few incorrect ones) than 'missing' or 'added' numbers.
Preferably, we should be be able to get a "certainty" measure for each digit.



I think every OCR can give you that. I know mine does.


Well, I saw some commercial tools that just don't have this (probably they have it in some lower level but don't export it to user. With closed source - this is a dead end...).



The solution should have some adjustable configuration parameters, such that we'll be able to reprocess selected 'important' pages (based on initial results) with some human adjustmants.



What parameters do you think about? clara has lots of such params, but I find it a bit uncomfortable to use, because you need the UI to play with them etc. Look at it, though. hebocr has very few such params. The main way to affect it is the font you provide (or build, using its output). Of course there are many things you can also do outside the OCR - e.g. if the scans are gray, you can change the threshold of white/black and have a dramatic change in the accuracy. Few OCRs also read gray directly, but it's not common.



Theyre b/w. I was thinking of stuff like thresholds for some filters and preprocessing algorithms. e.g. despeckle parameters (if page was noisy you could change), thresholds for cell/area detection algorithms (if for some reason you see that the algorithm missed some cells - you can tweak it).
Yes - probably would be uncomfortable to play with (UI could help), but you'd only do this for 'important' pages. it should still be faster than typing these pages manually.


How large is the database? If it will take only a few hours (maybe
even a day or two) to manually type it, I think no free OCR will do
better. To beat that you need one that needs no training and has very
good results. These two needs are hard to provide together. And numbers
are very fast to type, with very few mistakes, compared to text. It's a
specific niche in the job market - I know because I know people who
worked in this, wanted to move to OCR, and didn't because it wasn't good
enough.



As a very rough guess - 1000 pages (and their'e pretty dense with digits) - would simply require too many typists for too much time. We'd probably need some initial scan anyway - to prioritize the pages, and speed the typing by providing an initial suggestion to start with (e.g. scan could leave 'unknowns' in places it's not sure about, and typist would fill them in).



=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Reply via email to