On Tue, Mar 20, 2007 at 03:51:50PM +0200, Hetz Ben Hamo wrote:

> * can use Sane to scan a document
> * can save it to PDF
> * The PDF shouldn't be a dumb TIFF/JPG file page/collection, but a
> "real" PDF (so I can search/grep for words in the scanned doc)
> * Should have some basic hebrew OCR (optionally)
> 
> Any suggestions?

Windows. Seriously, OCR is not a new technology, what makes programs
better than others is the large library of fonts that it knows how to
handle. Commercial programs include lots of code to handle many fonts
of both different types and sizes.

For Linux, there is an open source program that will OCR, but it is
designed for books. It comes with a very limited library if any at
all of font information. When you use it, you need to train it to 
understand the font. Since books are usually printed with a handfull
of fonts (or less), it's very good for them.

Another problem is that OCR is inherently buggy. Most OCR programs
claim 95%, some as high as 99% accuracy. If an average line of text
is 60 characters, then you WILL have at least one error on every other
line. If your text lines are longer, the page contains mixed numbers
and letters, the accuracy goes down significantly. 

A program that can take a random page of text and produce something
useable is very rare depending upon your definition of useable.
If you just want to be able to search your library based upon keywords,
then it will probably work. If you want 100% accuracy in search, or
in text, then you will need to do a lot more than OCR.

If you are OCRing a library then you have a better chance. You still will
have confusion between a zero, letter O in Latin alphabets and a samech.
Ones and sevens also get confused and so on. 

Geoff.

-- 
Geoffrey S. Mendelson, Jerusalem, Israel [EMAIL PROTECTED]  N3OWJ/4X1GM
IL Voice: (07)-7424-1667  Fax ONLY: 972-2-648-1443 U.S. Voice: 1-215-821-1838 
Visit my 'blog at http://geoffstechno.livejournal.com/

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to