Linux & OCR (was: Re: MS troll)

Helmut Walle Fri, 31 Jan 2003 16:21:39 -0800

On Thu, 30 Jan 2003, Christopher Sawtell wrote:

> On Thu, 30 Jan 2003 17:39, Rex Johnston wrote:
>
> > As for OCR, well last time i looked, it did actually suck.
> > Anyone used clara ?
>
> In a word, hopeless. You have to train it for every font you are
> going to use, and even then it only works more than half reliably if
> the print quality is absolutely exceptional.


Yes, that is exactly what clara is supposed to be, and as such it
works fine, provided you use it for the intended kind of use, i.e.,
scanning whole books, where you use the first two pages for training,
and then you start production. I have tried this on a 102 year old
book which is neither printed very sharply nor is the paper smooth and
white. However, if you want to do something quickly, clara is not the
thing you want to use. And you should not economize on scan resolution
and file size if you want to use clara. With me not being an OCR
expert, I found I had to read the manual to get all the things like
joining broken characters and tuning the recognition right. But I
found once you get it tuned properly it works all right.

The other approach, was it called unifont ?, is used in gocr. No
training is required, but if the font you have is far from what gocr
knows, it does not work so well. But it's always worth a try, because
it is easy to use and fast, and you see quickly whether it works with
your scans. I once compared gocr results with the ones I got with the
OCR program that came with a HP scanner (was that Omnipage?), and
there was not much difference. There were some pages with Times and
Helvetica in different sizes, all on one page, and neither gocr nor
the commercial thing got everything right. I did not try clara on
these documents, because clara is made for long documents with only
one font. Of course, you can train clara to recognize more than one
font in one page, but you have to do more training, and results will
not necessarily get better.

> To write an OCR app which works properly is, imho, beyond the means
> of the usual kind of Linux project. The road is absolutely littered
> with failed attempts. I looked into the subject fairly throughly a

What is the "usual kind of Linux project"? With all the variety of
open source software, ranging from three-line perl scripts, over
medium-scale apps like GIMP, to absolutely professional quality
simulation packages like Ptolemy (which is, IMHO, technically better
than the commercial competitors who sell a single-user license at
prices far in excess of several tens of thousands of $), there does
not seem to be the "typical" thing.

And then it is a similarly wide variety of experience and education
levels the open source programmers have. The fact that software is
freely available does not say a bit about its quality. Neither does
the fact that you pay dearly for commercial software indicate the
quality. I have seen commercial packages for between 200 k$ and 500 k$
per single-user license just not do what the offer / technical spec
said. The bad thing with software is, that the matter can be too
complex even for experts to know that when you buy after some testing,
and in many cases license disclaimers would at most allow you to
return the software and reclaim your money, but they will never cover
the caused damage!

> year or two ago because I have a friend who is rather severely
> handicapped visually, and I wanted to make him a book reader. I
> ended up reading the book aloud to him.

Good on you! At least he got it read in much better quality than any
machine could possibly have done. BTW, I just returned a 5-CD set of
Douglas Adams' "The Restaurant at the End of the Universe", read by
himself, to the public library.

And one more point concerning e-texts and open software: before you
scan a book, always check first at Project Gutenberg whether someone
has already done the scanning & OCR. But that is certainly nothing new
to you :-)

Cheers,

Helmut.

+----------------+
| Helmut Walle   |
| [EMAIL PROTECTED] |
+----------------+

Linux & OCR (was: Re: MS troll)

Reply via email to