On Jan 1, 2013, at 11:37 PM, John Robinson <[email protected]> wrote:
> You really have me wondering. On EVERY scan there wasn't a single word > missed, and when I would do a search on even the smallest of print in the > front of the magazine it would find the word every time. When I would choose > a person or company in Spotlight it would find them, often there would be six > or seven of the now 16 I have now scanned that would have info. on the > question I had ask. We tested by starting with the LaTeX source for several complicated papers in different languages (English, German and French). Then we compiled and printed the results. These were fed through a pretty high-end Xerox scanner. After running OCR, we compared the text layer of the PDF+OCR to the text we started with and worked out the error percentage. After training, we tried it again to see how much improvement was evident. All of them were confused by mathematical formulae, but we were mostly interested in the text for searching, so that didn't bother us. None of them consistently scored above 98%. They were somewhat sensitive to fonts, with "Times-like" fonts with serifs seeming to be the best and small sans-serif fonts the worst. Since the journal has been using the same fonts for years (CM and Lucida families), training made a lot of difference in figuring out individual glyphs. All of these programs use dictionaries to aid in recognizing words. If the program can figure out, say, five of the six letters in a word, then it can make a pretty good guess about the sixth letter using its dictionary. Our text has a lot of technical words that don’t come in the standard dictionaries bundled with the programs. Training has a great effect here as well. > Is there something more I should be looking for? My needs with prospectuses, > annual reports, Edgar 10k & 2k's. I will have Barrons (once they release a > PDF ver., can't scan in that large a paper), Investor's Business Daily, > Forbes, Fortune and a few others. Text will be my main data but the filings > with the SEC will have numbers and tables. If what you're using works, great! But, all of these programs have their own strengths and weaknesses. > What am I missing, what do I need that Acrobat may not be giving? Maybe nothing. It seems that you're scanning English text with a Times-like font. (Don't really know because I haven't looked at a Forbes in … well … perhaps not this millennium.) Another thing to keep in mind is that there aren't really too many different OCR "engines" floating around. There are tons of programs that do OCR, but most of them are using software licensed from Readiris, Omnipage or ABBYY. For example, PDFpen uses Omnipage and many of those low-end programs bundled with scanners use Readiris. You can’t tell because they slap their own front ends onto the engine. There are even a few free engines out there. The only one I've tried is OCRopus. It's pretty fussy, but it does work. I usually use Readiris, but I do use PDFpen quite a lot because I can annotate the PDF pretty easily.
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ MacGroup mailing list [email protected] http://www.math.louisville.edu/mailman/listinfo/macgroup
