[cctalk] Re: OCR and line printers

Paul Koning via cctalk Wed, 03 Dec 2025 11:55:52 -0800

> On Dec 3, 2025, at 10:55 AM, Adrian Godwin via cctalk <[email protected]> 
> wrote:
> 
> I don't think it's the general quality of the patent print that's poor,
> it's the line-printer listing section from
> https://www.hp9845.net/9845/downloads/patents/US4089059.pdf starting at
> about page 213 of the pdf , possibly section 26 of the patent.
> 
> The print in that section is much paler than the rest - typical of a worn
> line-printer ribbon. I doubt the printed copy is any better.  I'm only
> trying to OCR the listing, not the rest of the patent.

That's quite a cleaen listing, actually, cleaner than most I have worked with 
and dramatically better than some.  The sort of slightly-damaged characters 
that appear should be no problem at all for the "training" feature of ABBYY 
Fine Reader to deal with.  What you'd have to do is run a number of pages 
through it in training mode, so it sees a number of variations of the 
individual characters.  And as I mentioned, you'd do all the scanning in the 
mode where it only accepts what it was trained with, no "builtin" patterns.  
That way it won't make up stuff that isn't part of the character set but 
happens to match something built-in, like a pound-sterling sign.

It may be that scanning the listing as a table (with the various columns as 
table columns) will work well, and give you the layout explicitly.  Or it can 
be scanned as plain text, but in that case the spacing will mostly turn into 
individual spaces and you'd need post-processing to insert tabs etc. to make it 
look right again.  Given the simple assembler syntax involved that sort of 
post-processing would not be hard.

        paul
[cctalk] Re: OCR and line printers

Reply via email to