On Wed, Apr 20, 2005 at 01:07:28AM +0300, Amit Aronovitch wrote: > Hi list, > > I'm looking for a free software tool for solving a specific OCR task > (details below). > I've googled up several projects, but because of time limits I'll not be > able to give them a fair try. > If you have some experience with such tools, I'll be grateful for any > tips. Specificly - how hard it would be to use them for solving the > specific task at hand. > > Note that since the task is not a "standard" one, I don't expect any > tool to give good enough results out-of-the-box. > Instead, I expect it to be modular, configurable, and scriptable enough > for writing a task-customised solution in short time.
I naturally know well only the one I wrote myself, hebocr. I am not sure if it's good for your needs, but I _am_ sure it's the smallest one in lines of code. So if you'll need to change it, it will probably be the easiest one to try. I played a bit with clara and it seems quite good. The latest news about hebocr are that someone (not me) invested some time in it and now it again works (even on cygwin) and has a tutorial. It's still considered dead by me, unless someone wants to take over. > > Description: > ------------------- > We have a large database of scans consisting of numerical data, which > need to be accurately converted to text, with minimal human intervention. > The scans are low quality (~200dpi, some with "dirt" (random dots), some > misaligned, some upside-down etc.). Are they fixed-font? > However, the data is numbers only, and comes in ordered tables with a > fixed number of columns (no border/cell lines at all). Numbers are of > fixed width. > Most of the cells are full + there are certain rules which can be > applied (i.e.: if a row (entry) has a number in column A, it must > contain data in column B too). I think no OCR has a language to say these rules, but some are script- friendly enough to allow this easily. hebocr has a "log" file in which it reports almost every important thing (including the position in pixels of every char) so you can use it instead of the raw text output. > > The data must be placed in the right columns: we prefer having a known > number of unknown digits (even a few incorrect ones) than 'missing' or > 'added' numbers. > Preferably, we should be be able to get a "certainty" measure for each > digit. I think every OCR can give you that. I know mine does. > The solution should have some adjustable configuration parameters, such > that we'll be able to reprocess selected 'important' pages (based on > initial results) with some human adjustmants. What parameters do you think about? clara has lots of such params, but I find it a bit uncomfortable to use, because you need the UI to play with them etc. Look at it, though. hebocr has very few such params. The main way to affect it is the font you provide (or build, using its output). Of course there are many things you can also do outside the OCR - e.g. if the scans are gray, you can change the threshold of white/black and have a dramatic change in the accuracy. Few OCRs also read gray directly, but it's not common. How large is the database? If it will take only a few hours (maybe even a day or two) to manually type it, I think no free OCR will do better. To beat that you need one that needs no training and has very good results. These two needs are hard to provide together. And numbers are very fast to type, with very few mistakes, compared to text. It's a specific niche in the job market - I know because I know people who worked in this, wanted to move to OCR, and didn't because it wasn't good enough. > > ---- > Sorry for the somewhat OT issue, but I know of no other forum I expect > to answer this in short time. > (besides, the preferred target platform IS linux :-) ). There was a long thread about OCR in whatsup around a year ago. You can also ask there. > > Amit A. -- Didi ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
