[cctalk] Re: OCR and line printers

Paul Koning via cctalk Fri, 28 Nov 2025 10:35:46 -0800

I too have used Fine Reader (paid for, but the price is not all that high) to 
OCR various listings.  It's way better than the Acrobat OCR, at least the one I 
tried ages ago on the Ethernet spec (back before Bitsavers, or at least before 
I found it).  The learning feature is very good, and important when dealing 
with low quality input.  You can also tell it only to recognize what you taught 
it, i.e., not try to match any builtin patterns.  If you're doing line printer 
listings with 64 character sets, that's helpful, otherwise it may mistake 
something blurry for a pound-sterling sign.

In some cases OCR just can't hack what it is given and the only option is to 
type it all in again.  I've done that with some listings that were ugly enough 
I sometimes had to zoom in to have one letter take up most of the screen, just 
to figure out what it was.  In that particular case, it also used a lot of 
overprinting to deal with mixed case text: upper case letters represented lower 
case, upper case with dot overprint for actual upper case.  You'd thing that 
OCR training could handle that, but the difference is subtle enough it doesn't 
really work.

Supposedly current versions of the open source program Tesseract are pretty 
good, but I haven't tried it.  Looking for how to train it got me all confused, 
it didn't seem to be something that was at all convenient, not like the 
interactive training feature of Fine Reader.

OCR likely will not handle fixed layout well (not unless you can treat it as 
tables).  If that's important, some Python or Emacs post-processing can clean 
up a lot.  Similarly if there are common recognition errors you can spot by 
pattern matching.  Scanning the CDC 6600 wire lists goes well this way, because 
the data have a very consistent pattern.  For example, the OCR might mix up 
zero and oh, but an edit pass can fix those 100%.

        paul

> On Nov 28, 2025, at 10:56 AM, David Wade via cctalk <[email protected]> 
> wrote:
> 
> I have a copy of Abbey Fine Reader Pro which I got free on a magazine many 
> years ago.
> If it reads a character incorrectly you can add to the image <=> character 
> map so it can adapt for example to a damaged slug on a line printer train or 
> other type element.
> Its not 100% but I used it to scan the IBM1130 CSMP from the manual....
> 
> Dave
> 
> On 28/11/2025 14:57, Guy Fedorkow via cctalk wrote:
>> Greetings Restorers,
>>   I think a number of us have wanted to restore software that's only 
>> available as a scanned listing from a line printer.  The original printout 
>> probably wasn't the best typographic quality, and scanning doesn't improve 
>> it.
>>   As a first pass, OCR with tools like Adobe Acrobat can easily produce a 
>> rough draft of the content in text form, but it takes almost as much work to 
>> correct the many "typos" as it does to simply re-type the listing.
>>   It seems like, with all this high-tech AI processing around, it should be 
>> possible to take advantage of the limited character set, fixed fonts, and 
>> restricted grammar that one might find in a listing to resolve more of the 
>> ambiguities in character recognition.
>>   Does anyone have an approach that's more efficient than generic OCR and a 
>> long process of correcting typos on every line of code or comment?
>>   Thanks
>> /guy
>> 
>> 
>

[cctalk] Re: OCR and line printers

Reply via email to