On Tue, Mar 20, 2007 at 03:51:50PM +0200, Hetz Ben Hamo wrote: > * can use Sane to scan a document > * can save it to PDF > * The PDF shouldn't be a dumb TIFF/JPG file page/collection, but a > "real" PDF (so I can search/grep for words in the scanned doc) > * Should have some basic hebrew OCR (optionally) > > Any suggestions?
Windows. Seriously, OCR is not a new technology, what makes programs better than others is the large library of fonts that it knows how to handle. Commercial programs include lots of code to handle many fonts of both different types and sizes. For Linux, there is an open source program that will OCR, but it is designed for books. It comes with a very limited library if any at all of font information. When you use it, you need to train it to understand the font. Since books are usually printed with a handfull of fonts (or less), it's very good for them. Another problem is that OCR is inherently buggy. Most OCR programs claim 95%, some as high as 99% accuracy. If an average line of text is 60 characters, then you WILL have at least one error on every other line. If your text lines are longer, the page contains mixed numbers and letters, the accuracy goes down significantly. A program that can take a random page of text and produce something useable is very rare depending upon your definition of useable. If you just want to be able to search your library based upon keywords, then it will probably work. If you want 100% accuracy in search, or in text, then you will need to do a lot more than OCR. If you are OCRing a library then you have a better chance. You still will have confusion between a zero, letter O in Latin alphabets and a samech. Ones and sevens also get confused and so on. Geoff. -- Geoffrey S. Mendelson, Jerusalem, Israel [EMAIL PROTECTED] N3OWJ/4X1GM IL Voice: (07)-7424-1667 Fax ONLY: 972-2-648-1443 U.S. Voice: 1-215-821-1838 Visit my 'blog at http://geoffstechno.livejournal.com/ ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
