[cctalk] Re: OCR and line printers

Marc Howard via cctalk Fri, 05 Dec 2025 06:19:21 -0800

I've been trying for years to get usable scans of old computer listings.
Look at the attached file.  It was printed with a Diablo HyType and new
ribbon.  There are only 64 unique characters (6 bit character set) in the
entire listing.  Courier (non-proportional) font.  And yet the OCRs are
miserable, even ChatGPTs feeble effort.


Marc



On Wed, Dec 3, 2025 at 2:55 PM Paul Koning via cctalk <[email protected]>
wrote:

>
>
> > On Dec 3, 2025, at 10:55 AM, Adrian Godwin via cctalk <
> [email protected]> wrote:
> >
> > I don't think it's the general quality of the patent print that's poor,
> > it's the line-printer listing section from
> > https://www.hp9845.net/9845/downloads/patents/US4089059.pdf starting at
> > about page 213 of the pdf , possibly section 26 of the patent.
> >
> > The print in that section is much paler than the rest - typical of a worn
> > line-printer ribbon. I doubt the printed copy is any better.  I'm only
> > trying to OCR the listing, not the rest of the patent.
>
> That's quite a cleaen listing, actually, cleaner than most I have worked
> with and dramatically better than some.  The sort of slightly-damaged
> characters that appear should be no problem at all for the "training"
> feature of ABBYY Fine Reader to deal with.  What you'd have to do is run a
> number of pages through it in training mode, so it sees a number of
> variations of the individual characters.  And as I mentioned, you'd do all
> the scanning in the mode where it only accepts what it was trained with, no
> "builtin" patterns.  That way it won't make up stuff that isn't part of the
> character set but happens to match something built-in, like a
> pound-sterling sign.
>
> It may be that scanning the listing as a table (with the various columns
> as table columns) will work well, and give you the layout explicitly.  Or
> it can be scanned as plain text, but in that case the spacing will mostly
> turn into individual spaces and you'd need post-processing to insert tabs
> etc. to make it look right again.  Given the simple assembler syntax
> involved that sort of post-processing would not be hard.
>
>         paul
>
>
>

[cctalk] Re: OCR and line printers

Reply via email to