Re: [opensuse] OCR and batches

Anders Norrbring Sun, 18 Mar 2007 07:08:20 -0800

Kai Ponte skrev:

On Sunday 18 March 2007 02:57:49 am Anders Norrbring wrote:
Does anybody know of a way to scan several thousands pictures on disk
with an OCR application to look for a specific text, and then list the
images where that text was found?
I've been doing apps like that for the better part of twelve years. However,I've yet to see an OCR app in Linux. That doesn't mean they don't exist,however, because I'm stuck in a predominantly windows world at work. I knowwe currently either use OCR For Anydocs or Kofax Ascent. In fact, we'relooking at replacing our current systems in the next few years.What you will need to do is probably write some program to take the imageddocuments - done so with whatever scanner you've got - and then process thedocuments through the OCR engine supplied by the manufacturer. Typically thisis a library like the AVI or MP3 libraries used by your most commonlyrequested SUSE applications.Keep in mind, that you'll need to also have a retrieval program of some sort,to actually get the documents and view them - along with the OCR data - insome manner. This is one I wrote in 2003, which combined OCR from barcodeand an imaging application based on FileNet:http://www.filesite.org/viewtopic.php?t=173You didn't mention whether you're doing spot or forms recognition or full-textOCR. You might also look at barcode recognition, because those are VERYreliable, even over fax.
Try these links for the OCR software:

Google apparently has an OCR engine that is now OSS..

http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html

http://sourceforge.net/projects/tesseract-ocr
..never heard of it before this morning. Should be interesting to look atthough. Apparently it is an old HP-based software that had been shelved fortwelve years and now resurrected.
ABBYY is a well-known industrial-strength app for OCR. I've never personallyused them (mostly stick with Caere) but have heard great things....AND...theyhave an SDK for *nix and/or TheCultOfMac.
http://www.abbyy.com/sdk/?param=59956

I aslo saw this one...

http://www.linux-ocr.ekitap.gen.tr/
Keep in mind that we process over 3M documents/year - that comes out toroughly 15,000 every day, including weekends. We currently have eighthigh-speed scanners, and are evaluating whether to purchase some new Kodaki860 models at $75,000 each. I just state this so you know our volume.

Thanks, I'm not locked to Linux for this adventure, but the images arestored on a Linux system but can be accessed from Windows.

Maybe I should explain more what I'm about to do, it's not really a fulltext scan..The images are from a dozen or so different photographers, who all puttheir copyright notice in text on every image. What I want to accomplishis to categorize them all according to who took the picture, in otherwords, sort them by photographer name. So, the OCR should only read oneor two words out of a maximum of 4-6 words somewhere in the image.It's also a one-time thing to do, so I cannot motivate a license costfor a fully fledged OCR suite.


I'll take a look at the links you provided, thanks!

--

Anders Norrbring
Norrbring Consulting
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] OCR and batches

Reply via email to