Re: [MacGroup] A New Discovery

Lee Larson Wed, 02 Jan 2013 08:27:01 -0800

On Jan 1, 2013, at 11:37 PM, John Robinson <[email protected]> wrote:


> You really have me wondering.  On EVERY scan there wasn't a single word 
> missed, and when I would do a search on even the smallest of print in the 
> front of the magazine it would find the word every time.  When I would choose 
> a person or company in Spotlight it would find them, often there would be six 
> or seven of the now 16 I have now scanned that would have info. on the 
> question I had ask. 

We tested by starting with the LaTeX source for several complicated papers in 
different languages (English, German and French). Then we compiled and printed 
the results. These were fed through a pretty high-end Xerox scanner. After 
running OCR, we compared the text layer of the PDF+OCR to the text we started 
with and worked out the error percentage. After training, we tried it again to 
see how much improvement was evident.

All of them were confused by mathematical formulae, but we were mostly 
interested in the text for searching, so that didn't bother us. None of them 
consistently scored above 98%. They were somewhat sensitive to fonts, with 
"Times-like" fonts with serifs seeming to be the best and small sans-serif 
fonts the worst.

Since the journal has been using the same fonts for years (CM and Lucida 
families), training made a lot of difference in figuring out individual glyphs.

All of these programs use dictionaries to aid in recognizing words. If the 
program can figure out, say, five of the six letters in a word, then it can 
make a pretty good guess about the sixth letter using its dictionary. Our text 
has a lot of technical words that don’t  come in the standard dictionaries 
bundled with the programs. Training has a great effect here as well.


> Is there something more I should be looking for?  My needs with prospectuses, 
> annual reports, Edgar 10k & 2k's.  I will have Barrons (once they release a 
> PDF ver., can't scan in that large a paper), Investor's Business Daily, 
> Forbes, Fortune and a few others.  Text will be my main data but the filings 
> with the SEC will have numbers and tables.  

If what you're using works, great! But, all of these programs have their own 
strengths and weaknesses.

> What am I missing, what do I need that Acrobat may not be giving?

Maybe nothing. It seems that you're scanning English text with a Times-like 
font. (Don't really know because I haven't looked at a Forbes in … well … 
perhaps not this millennium.)

Another thing to keep in mind is that there aren't really too many different 
OCR "engines" floating around. There are tons of programs that do OCR, but most 
of them are using software licensed from Readiris, Omnipage or ABBYY. For 
example, PDFpen uses Omnipage and many of those low-end programs bundled with 
scanners use Readiris. You can’t  tell because they slap their own front ends 
onto the engine.

There are even a few free engines out there. The only one I've tried is 
OCRopus. It's pretty fussy, but it does work.

I usually use Readiris, but I do use PDFpen quite a lot because I can annotate 
the PDF pretty easily.

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
MacGroup mailing list
[email protected]
http://www.math.louisville.edu/mailman/listinfo/macgroup

Re: [MacGroup] A New Discovery

Reply via email to