Re: If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Paul Koning Sun, 27 Sep 2015 14:34:16 -0700

> On Sep 26, 2015, at 5:42 PM, Toby Thain <[email protected]> wrote:
> ...
> Software which "recreates" the typography of a document from OCR does not 
> produce an acceptable substitute, I've yet to see a book that wasn't ruined 
> by it.



True.  But that's not the biggest problem with OCR.  The biggest problem is 
that even professional grade OCR programs have rather low accuracy.  Maybe they 
do acceptably well on really high grade scans of very clean new documents, but 
on books, typewritten documents, etc., even after you use the "train" feature 
you need to spend a long time cleaning up.  It may be faster than retyping 
things, if you're lucky.  Not if you're not; two of us recently retyped 300 
pages of line printer listing because that was faster and more accurate than 
OCR on that particular printout.

Given that OCR can only do, at best, a just barely acceptable recognition of 
the letters of the alphabet, it follows that accurately recognizing the actual 
font used will be vastly less accurate.  And indeed you can see that clearly.

I wonder if there are OCR programs that can be told to choose among 2 or 3 
fonts, as opposed to guess from the entire inventory of the machine.  If so, 
and if they are sufficiently distinct, then maybe you'd stand a chance.  
Especially if it also added heuristics like "never change fonts in mid-word" -- 
an obvious rule but not one I have seen implemented.

        paul

Re: If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Reply via email to