On Thu, Apr 21, 2005 at 01:15:41AM +0300, Amit Aronovitch wrote: > Thanks alot for the reply. > I'll try having a look at both tools. > > Yedidyah Bar-David wrote: > > >>Description: > >>------------------- > >>We have a large database of scans consisting of numerical data, which > >>need to be accurately converted to text, with minimal human intervention. > >>The scans are low quality (~200dpi, some with "dirt" (random dots), some > >>misaligned, some upside-down etc.). > >> > >> > > > >Are they fixed-font? > > > > > > > Yes. At least the pages I saw (data came from several sources).
So it'll be easier. hebocr and clara do only fixed-font. > > >>However, the data is numbers only, and comes in ordered tables with a > >>fixed number of columns (no border/cell lines at all). Numbers are of > >>fixed width. > >>Most of the cells are full + there are certain rules which can be > >>applied (i.e.: if a row (entry) has a number in column A, it must > >>contain data in column B too). > >> > >> > > > >I think no OCR has a language to say these rules, but some are script- > >friendly enough to allow this easily. hebocr has a "log" file in which > >it reports almost every important thing (including the position in > >pixels of every char) so you can use it instead of the raw text output. > > > > > > > I was thinking of writing some script/program that calls some lower > level API (e.g. call some algorithm for finding cell positions, then use > above mentioned rules to improve this suggestion, then once you know > where the cells are, and you know each one should contain a number with > a known number of digits - call the actual digit recognition function > for each cell). > > Do you think it would be simple enough to write such an algorithm with > hebocr? (e.g.: does it have seperately callable "region-finding" and > "char-recognition" functions?). It does have specific sets of functions (it's C - no classes, packages, etc.) for dealing with different tasks. It has a short text file describing how it works internally (written by me) and, as I said, someone wrote a much longer tutorial, which I don't know if is available online, I'll ask his permission. > > >>The data must be placed in the right columns: we prefer having a known > >>number of unknown digits (even a few incorrect ones) than 'missing' or > >>'added' numbers. > >>Preferably, we should be be able to get a "certainty" measure for each > >>digit. > >> > >> > > > >I think every OCR can give you that. I know mine does. > > > > > Well, I saw some commercial tools that just don't have this (probably > they have it in some lower level but don't export it to user. With > closed source - this is a dead end...). > > > > > > >>The solution should have some adjustable configuration parameters, such > >>that we'll be able to reprocess selected 'important' pages (based on > >>initial results) with some human adjustmants. > >> > >> > > > >What parameters do you think about? > >clara has lots of such params, but I find it a bit uncomfortable to use, > >because you need the UI to play with them etc. Look at it, though. > >hebocr has very few such params. The main way to affect it is the font > >you provide (or build, using its output). Of course there are many > >things you can also do outside the OCR - e.g. if the scans are gray, you > >can change the threshold of white/black and have a dramatic change in > >the accuracy. Few OCRs also read gray directly, but it's not common. > > > > > > > Theyre b/w. I was thinking of stuff like thresholds for some filters and > preprocessing algorithms. e.g. despeckle parameters (if page was noisy > you could change), thresholds for cell/area detection algorithms (if > for some reason you see that the algorithm missed some cells - you can > tweak it). I don't think FOSS OCRs deal with such things. There are other tools for this, and you can glue them together with a small script. I know I personally scanned in grayscale and only played with the threshold for white/black. In my case there were pages with different blackness, and so eventually I wrote a little script that computes the threshold so that they all have more-or-less the same coverage (17% in my case). > Yes - probably would be uncomfortable to play with (UI could help), but > you'd only do this for 'important' pages. it should still be faster than > typing these pages manually. > > >How large is the database? If it will take only a few hours (maybe > >even a day or two) to manually type it, I think no free OCR will do > >better. To beat that you need one that needs no training and has very > >good results. These two needs are hard to provide together. And numbers > >are very fast to type, with very few mistakes, compared to text. It's a > >specific niche in the job market - I know because I know people who > >worked in this, wanted to move to OCR, and didn't because it wasn't good > >enough. > > > > > > > As a very rough guess - 1000 pages (and their'e pretty dense with > digits) - would simply require too many typists for too much time. We'd > probably need some initial scan anyway - to prioritize the pages, and > speed the typing by providing an initial suggestion to start with (e.g. > scan could leave 'unknowns' in places it's not sure about, and typist > would fill them in). 1000 pages sounds reasonable for trying an OCR. -- Didi ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]