[cctalk] Re: Are there any useful OCR programs for scanning old listings and producing text with proper formatting

Paul Koning via cctalk Thu, 11 May 2023 09:24:25 -0700

> On May 11, 2023, at 12:12 PM, Marc Howard via cctalk <[email protected]> 
> wrote:
> 
> Marc Howard <[email protected]>
> [image: Attachments]May 10, 2023, 8:58 PM (15 hours ago)
> to cctalk-owner
> I have some listings I want to convert to ASCII.  They're line printer
> output from a computer that existed from the mid-sixties to the early 70's
> (Agage AGT series).
> 
> I can't find any OCR package that can take scanner output (either PDF or
> JPEG) and convert it to text with roughly the same number of spaces between
> words as was there originally.
> 
> Seems like it would be an easy task.  The input is non-proportional text
> from line printer output (actually it might have been printed on a Diablo
> hytype).  And yet all I get is most of the characters with either no or
> single spacing between words.  And it misses quite a bit of scanned
> characters at that.

Tesseract supposedly can do this.  There's a Tesseract fork, I don't remember 
the name, that was tweaked specifically for listings.  I believe it was a 
Japanese project.

I often use ABBYY FineReader, which does a good job with tough source material 
and has a good training feature.  It will not lose spaces entirely, but as you 
said, it does collapse multiple spaces.  For dealing with listings of 
structured material, like assembler output listings, I found that telling the 
program to interpret the page as tabular material works well.  That (usually) 
preserves line endings which is also important, and it breaks the material up 
into columns so at least you can do a "pretty printer" type of cleanup on the 
rows of "table fields" that result.

        paul

[cctalk] Re: Are there any useful OCR programs for scanning old listings and producing text with proper formatting

Reply via email to