> On May 11, 2023, at 12:12 PM, Marc Howard via cctalk <[email protected]>
> wrote:
>
> Marc Howard <[email protected]>
> [image: Attachments]May 10, 2023, 8:58 PM (15 hours ago)
> to cctalk-owner
> I have some listings I want to convert to ASCII. They're line printer
> output from a computer that existed from the mid-sixties to the early 70's
> (Agage AGT series).
>
> I can't find any OCR package that can take scanner output (either PDF or
> JPEG) and convert it to text with roughly the same number of spaces between
> words as was there originally.
>
> Seems like it would be an easy task. The input is non-proportional text
> from line printer output (actually it might have been printed on a Diablo
> hytype). And yet all I get is most of the characters with either no or
> single spacing between words. And it misses quite a bit of scanned
> characters at that.
Tesseract supposedly can do this. There's a Tesseract fork, I don't remember
the name, that was tweaked specifically for listings. I believe it was a
Japanese project.
I often use ABBYY FineReader, which does a good job with tough source material
and has a good training feature. It will not lose spaces entirely, but as you
said, it does collapse multiple spaces. For dealing with listings of
structured material, like assembler output listings, I found that telling the
program to interpret the page as tabular material works well. That (usually)
preserves line endings which is also important, and it breaks the material up
into columns so at least you can do a "pretty printer" type of cleanup on the
rows of "table fields" that result.
paul